https://metr.org/blog/2024-11-22-evaluating-r-d-capabilities-of-llms/

https://metr.org/AI_R_D_Evaluation_Report.pdf

The report titled "RE-Bench: Evaluating Frontier AI R&D Capabilities of Language Model Agents Against Human Experts" introduces RE-Bench, a benchmark designed to assess whether AI agents can fully automate the work of expert AI researchers.

Key Contributions:

Findings:

Conclusion:

The study suggests that while AI agents exhibit impressive capabilities in automating aspects of AI research and development, human experts currently maintain an edge with extended time budgets. The open-sourcing of evaluation environments, human expert data, analysis code, and agent trajectories aims to facilitate future research in this domain