Measuring AI Ability to Complete Long Tasks
- URL: http://arxiv.org/abs/2503.14499v1
- Date: Tue, 18 Mar 2025 17:59:31 GMT
- Title: Measuring AI Ability to Complete Long Tasks
- Authors: Thomas Kwa, Ben West, Joel Becker, Amy Deng, Katharyn Garcia, Max Hasin, Sami Jawhar, Megan Kinniment, Nate Rush, Sydney Von Arx, Ryan Bloom, Thomas Broadley, Haoxing Du, Brian Goodrich, Nikola Jurkovic, Luke Harold Miles, Seraphina Nix, Tao Lin, Neev Parikh, David Rein, Lucas Jun Koba Sato, Hjalmar Wijk, Daniel M. Ziegler, Elizabeth Barnes, Lawrence Chan,
- Abstract summary: We measure the time humans typically take to complete tasks that AI models can complete with 50% success rate.<n>Current frontier AI models such as Claude 3.7 Sonnet have a 50% time horizon of around 50 minutes.<n>The increase in AI models' time horizons seems to be driven by greater reliability and ability to adapt to mistakes.
- Score: 5.986082428339293
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite rapid progress on AI benchmarks, the real-world meaning of benchmark performance remains unclear. To quantify the capabilities of AI systems in terms of human capabilities, we propose a new metric: 50%-task-completion time horizon. This is the time humans typically take to complete tasks that AI models can complete with 50% success rate. We first timed humans with relevant domain expertise on a combination of RE-Bench, HCAST, and 66 novel shorter tasks. On these tasks, current frontier AI models such as Claude 3.7 Sonnet have a 50% time horizon of around 50 minutes. Furthermore, frontier AI time horizon has been doubling approximately every seven months since 2019, though the trend may have accelerated in 2024. The increase in AI models' time horizons seems to be primarily driven by greater reliability and ability to adapt to mistakes, combined with better logical reasoning and tool use capabilities. We discuss the limitations of our results -- including their degree of external validity -- and the implications of increased autonomy for dangerous capabilities. If these results generalize to real-world software tasks, extrapolation of this trend predicts that within 5 years, AI systems will be capable of automating many software tasks that currently take humans a month.
Related papers
- AGI Is Coming... Right After AI Learns to Play Wordle [4.2909314120969855]
multimodal agents, in particular, OpenAI's Computer-User Agent (CUA), trained to control and complete tasks through a standard computer interface, similar to humans.
We evaluated the agent's performance on the New York Times Wordle game to elicit model behaviors and identify shortcomings.
arXiv Detail & Related papers (2025-04-21T20:58:58Z) - HCAST: Human-Calibrated Autonomy Software Tasks [1.5287939112540956]
We present HCAST, a benchmark of 189 machine learning engineering, cybersecurity, software engineering, and general reasoning tasks.
We estimate that HCAST tasks take humans between one minute and 8+ hours.
We evaluate the success rates of AI agents built on frontier foundation models.
arXiv Detail & Related papers (2025-03-21T17:54:01Z) - General Scales Unlock AI Evaluation with Explanatory and Predictive Power [57.7995945974989]
benchmarking has guided progress in AI, but it has offered limited explanatory and predictive power for general-purpose AI systems.<n>We introduce general scales for AI evaluation that can explain what common AI benchmarks really measure.<n>Our fully-automated methodology builds on 18 newly-crafted rubrics that place instance demands on general scales that do not saturate.
arXiv Detail & Related papers (2025-03-09T01:13:56Z) - Evaluating Intelligence via Trial and Error [59.80426744891971]
We introduce Survival Game as a framework to evaluate intelligence based on the number of failed attempts in a trial-and-error process.<n>When the expectation and variance of failure counts are both finite, it signals the ability to consistently find solutions to new challenges.<n>Our results show that while AI systems achieve the Autonomous Level in simple tasks, they are still far from it in more complex tasks.
arXiv Detail & Related papers (2025-02-26T05:59:45Z) - TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks [52.46737975742287]
We build a self-contained environment with data that mimics a small software company environment.
We find that with the most competitive agent, 24% of the tasks can be completed autonomously.
This paints a nuanced picture on task automation with LM agents.
arXiv Detail & Related papers (2024-12-18T18:55:40Z) - RE-Bench: Evaluating frontier AI R&D capabilities of language model agents against human experts [4.112091541691995]
We introduce RE-Bench, which consists of 7 challenging, open-ended ML research engineering environments and data from 71 8-hour attempts by 61 human experts.
We find that the best AI agents achieve a score 4x higher than human experts when both are given a total time budget of 2 hours per environment.
Humans currently display better returns to increasing time budgets, narrowly exceeding the top AI agent scores given an 8-hour budget, and achieving 2x the score of the top AI agent when both are given 32 total hours (across different attempts).
arXiv Detail & Related papers (2024-11-22T18:30:46Z) - Generative Diffusion-based Contract Design for Efficient AI Twins Migration in Vehicular Embodied AI Networks [55.15079732226397]
Embodied AI is a rapidly advancing field that bridges the gap between cyberspace and physical space.
In VEANET, embodied AI twins act as in-vehicle AI assistants to perform diverse tasks supporting autonomous driving.
arXiv Detail & Related papers (2024-10-02T02:20:42Z) - Towards the Terminator Economy: Assessing Job Exposure to AI through LLMs [10.844598404826355]
One-third of U.S. employment is highly exposed to AI, primarily in high-skill jobs.
This exposure correlates positively with employment and wage growth from 2019 to 2023.
arXiv Detail & Related papers (2024-07-27T08:14:18Z) - Work-in-Progress: Crash Course: Can (Under Attack) Autonomous Driving Beat Human Drivers? [60.51287814584477]
This paper evaluates the inherent risks in autonomous driving by examining the current landscape of AVs.
We develop specific claims highlighting the delicate balance between the advantages of AVs and potential security challenges in real-world scenarios.
arXiv Detail & Related papers (2024-05-14T09:42:21Z) - Thousands of AI Authors on the Future of AI [1.0717301750064765]
Most respondents expressed substantial uncertainty about the long-term value of AI progress.
More than half suggested that "substantial" or "extreme" concern is warranted about six different AI-related scenarios.
There was disagreement about whether faster or slower AI progress would be better for the future of humanity.
arXiv Detail & Related papers (2024-01-05T14:53:09Z) - Fairness in AI and Its Long-Term Implications on Society [68.8204255655161]
We take a closer look at AI fairness and analyze how lack of AI fairness can lead to deepening of biases over time.
We discuss how biased models can lead to more negative real-world outcomes for certain groups.
If the issues persist, they could be reinforced by interactions with other risks and have severe implications on society in the form of social unrest.
arXiv Detail & Related papers (2023-04-16T11:22:59Z) - Hybrid Intelligence [4.508830262248694]
We argue that the most likely paradigm for the division of labor between humans and machines in the next decades is Hybrid Intelligence.
This concept aims at using the complementary strengths of human intelligence and AI, so that they can perform better than each of the two could separately.
arXiv Detail & Related papers (2021-05-03T08:56:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.