RE-Bench: Evaluating frontier AI R&D capabilities of language model agents against human experts
- URL: http://arxiv.org/abs/2411.15114v1
- Date: Fri, 22 Nov 2024 18:30:46 GMT
- Title: RE-Bench: Evaluating frontier AI R&D capabilities of language model agents against human experts
- Authors: Hjalmar Wijk, Tao Lin, Joel Becker, Sami Jawhar, Neev Parikh, Thomas Broadley, Lawrence Chan, Michael Chen, Josh Clymer, Jai Dhyani, Elena Ericheva, Katharyn Garcia, Brian Goodrich, Nikola Jurkovic, Megan Kinniment, Aron Lajko, Seraphina Nix, Lucas Sato, William Saunders, Maksym Taran, Ben West, Elizabeth Barnes,
- Abstract summary: We introduce RE-Bench, which consists of 7 challenging, open-ended ML research engineering environments and data from 71 8-hour attempts by 61 human experts.
We find that the best AI agents achieve a score 4x higher than human experts when both are given a total time budget of 2 hours per environment.
Humans currently display better returns to increasing time budgets, narrowly exceeding the top AI agent scores given an 8-hour budget, and achieving 2x the score of the top AI agent when both are given 32 total hours (across different attempts).
- Score: 4.112091541691995
- License:
- Abstract: Frontier AI safety policies highlight automation of AI research and development (R&D) by AI agents as an important capability to anticipate. However, there exist few evaluations for AI R&D capabilities, and none that are highly realistic and have a direct comparison to human performance. We introduce RE-Bench (Research Engineering Benchmark, v1), which consists of 7 challenging, open-ended ML research engineering environments and data from 71 8-hour attempts by 61 distinct human experts. We confirm that our experts make progress in the environments given 8 hours, with 82% of expert attempts achieving a non-zero score and 24% matching or exceeding our strong reference solutions. We compare humans to several public frontier models through best-of-k with varying time budgets and agent designs, and find that the best AI agents achieve a score 4x higher than human experts when both are given a total time budget of 2 hours per environment. However, humans currently display better returns to increasing time budgets, narrowly exceeding the top AI agent scores given an 8-hour budget, and achieving 2x the score of the top AI agent when both are given 32 total hours (across different attempts). Qualitatively, we find that modern AI agents possess significant expertise in many ML topics -- e.g. an agent wrote a faster custom Triton kernel than any of our human experts' -- and can generate and test solutions over ten times faster than humans, at much lower cost. We open-source the evaluation environments, human expert data, analysis code and agent trajectories to facilitate future research.
Related papers
- ML Research Benchmark [0.0]
We present the ML Research Benchmark (MLRB), comprising 7 competition-level tasks derived from recent machine learning conference tracks.
This paper introduces a novel benchmark and evaluates it using agent scaffolds powered by frontier models, including Claude-3 and GPT-4o.
The results indicate that the Claude-3.5 Sonnet agent performs best across our benchmark, excelling in planning and developing machine learning models.
arXiv Detail & Related papers (2024-10-29T21:38:42Z) - Raising the Stakes: Performance Pressure Improves AI-Assisted Decision Making [57.53469908423318]
We show the effects of performance pressure on AI advice reliance when laypeople complete a common AI-assisted task.
We find that when the stakes are high, people use AI advice more appropriately than when stakes are lower, regardless of the presence of an AI explanation.
arXiv Detail & Related papers (2024-10-21T22:39:52Z) - H-ARC: A Robust Estimate of Human Performance on the Abstraction and Reasoning Corpus Benchmark [7.840781070208872]
Since 2019, limited progress has been observed on the challenge using existing artificial intelligence methods.
Previous work explored how well humans can solve tasks from the ARC benchmark.
We obtain a more robust estimate of human performance by evaluating 1729 humans on the full set of 400 training and 400 evaluation tasks.
arXiv Detail & Related papers (2024-09-02T17:11:32Z) - Tree Search for Language Model Agents [69.43007235771383]
We propose an inference-time search algorithm for LM agents to perform exploration and multi-step planning in interactive web environments.
Our approach is a form of best-first tree search that operates within the actual environment space.
It is the first tree search algorithm for LM agents that shows effectiveness on realistic web tasks.
arXiv Detail & Related papers (2024-07-01T17:07:55Z) - The Rise and Potential of Large Language Model Based Agents: A Survey [91.71061158000953]
Large language models (LLMs) are regarded as potential sparks for Artificial General Intelligence (AGI)
We start by tracing the concept of agents from its philosophical origins to its development in AI, and explain why LLMs are suitable foundations for agents.
We explore the extensive applications of LLM-based agents in three aspects: single-agent scenarios, multi-agent scenarios, and human-agent cooperation.
arXiv Detail & Related papers (2023-09-14T17:12:03Z) - Bending the Automation Bias Curve: A Study of Human and AI-based
Decision Making in National Security Contexts [0.0]
We theorize about the relationship between background knowledge about AI, trust in AI, and how these interact with other factors to influence the probability of automation bias.
We test these in a preregistered task identification experiment across a representative sample of 9000 adults in 9 countries with varying levels of AI industries.
arXiv Detail & Related papers (2023-06-28T18:57:36Z) - BO-Muse: A human expert and AI teaming framework for accelerated
experimental design [58.61002520273518]
Our algorithm lets the human expert take the lead in the experimental process.
We show that our algorithm converges sub-linearly, at a rate faster than the AI or human alone.
arXiv Detail & Related papers (2023-03-03T02:56:05Z) - Measuring an artificial intelligence agent's trust in humans using
machine incentives [2.1016374925364616]
Gauging an AI agent's trust in humans is challenging because dishonesty might respond falsely about their trust in humans.
We present a method for incentivizing machine decisions without altering an AI agent's underlying algorithms or goal orientation.
Our experiments suggest that one of the most advanced AI language models to date alters its social behavior in response to incentives.
arXiv Detail & Related papers (2022-12-27T06:05:49Z) - Should I Follow AI-based Advice? Measuring Appropriate Reliance in
Human-AI Decision-Making [0.0]
We aim to enable humans not to rely on AI advice blindly but rather to distinguish its quality and act upon it to make better decisions.
Current research lacks a metric for appropriate reliance (AR) on AI advice on a case-by-case basis.
We propose to view AR as a two-dimensional construct that measures the ability to discriminate advice quality and behave accordingly.
arXiv Detail & Related papers (2022-04-14T12:18:51Z) - Is the Most Accurate AI the Best Teammate? Optimizing AI for Teamwork [54.309495231017344]
We argue that AI systems should be trained in a human-centered manner, directly optimized for team performance.
We study this proposal for a specific type of human-AI teaming, where the human overseer chooses to either accept the AI recommendation or solve the task themselves.
Our experiments with linear and non-linear models on real-world, high-stakes datasets show that the most accuracy AI may not lead to highest team performance.
arXiv Detail & Related papers (2020-04-27T19:06:28Z) - Effect of Confidence and Explanation on Accuracy and Trust Calibration
in AI-Assisted Decision Making [53.62514158534574]
We study whether features that reveal case-specific model information can calibrate trust and improve the joint performance of the human and AI.
We show that confidence score can help calibrate people's trust in an AI model, but trust calibration alone is not sufficient to improve AI-assisted decision making.
arXiv Detail & Related papers (2020-01-07T15:33:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.