Evaluating frontier AI R&D capabilities of language model agents against human experts

https://metr.org/blog/2024-11-22-evaluating-r-d-capabilities-of-llms/

https://metr.org/AI_R_D_Evaluation_Report.pdf

The report titled "RE-Bench: Evaluating Frontier AI R&D Capabilities of Language Model Agents Against Human Experts" introduces RE-Bench, a benchmark designed to assess whether AI agents can fully automate the work of expert AI researchers.

Key Contributions:

RE-Bench Benchmark: Comprises seven challenging, open-ended machine learning research engineering environments.
Human Expert Data: Includes data from 71 attempts by 61 distinct human experts, each dedicating 8 hours to the tasks.

Findings:

Human Performance:
- 82% of expert attempts achieved a non-zero score.
- 24% matched or exceeded strong reference solutions.
AI Agent Performance:
- AI agents achieved scores four times higher than human experts when both had a total time budget of 2 hours per environment.
- Humans displayed better returns with increased time budgets, narrowly exceeding top AI agent scores with an 8-hour budget and achieving twice the score with a 32-hour budget.
Qualitative Observations:
- Modern AI agents possess significant expertise in various machine learning topics.
- AI agents can generate and test solutions over ten times faster than humans at a much lower cost.

Conclusion:

The study suggests that while AI agents exhibit impressive capabilities in automating aspects of AI research and development, human experts currently maintain an edge with extended time budgets. The open-sourcing of evaluation environments, human expert data, analysis code, and agent trajectories aims to facilitate future research in this domain