HCAST: Human-Calibrated Autonomy Software Tasks
- URL: http://arxiv.org/abs/2503.17354v1
- Date: Fri, 21 Mar 2025 17:54:01 GMT
- Title: HCAST: Human-Calibrated Autonomy Software Tasks
- Authors: David Rein, Joel Becker, Amy Deng, Seraphina Nix, Chris Canal, Daniel O'Connel, Pip Arnott, Ryan Bloom, Thomas Broadley, Katharyn Garcia, Brian Goodrich, Max Hasin, Sami Jawhar, Megan Kinniment, Thomas Kwa, Aron Lajko, Nate Rush, Lucas Jun Koba Sato, Sydney Von Arx, Ben West, Lawrence Chan, Elizabeth Barnes,
- Abstract summary: We present HCAST, a benchmark of 189 machine learning engineering, cybersecurity, software engineering, and general reasoning tasks.<n>We estimate that HCAST tasks take humans between one minute and 8+ hours.<n>We evaluate the success rates of AI agents built on frontier foundation models.
- Score: 1.5287939112540956
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: To understand and predict the societal impacts of highly autonomous AI systems, we need benchmarks with grounding, i.e., metrics that directly connect AI performance to real-world effects we care about. We present HCAST (Human-Calibrated Autonomy Software Tasks), a benchmark of 189 machine learning engineering, cybersecurity, software engineering, and general reasoning tasks. We collect 563 human baselines (totaling over 1500 hours) from people skilled in these domains, working under identical conditions as AI agents, which lets us estimate that HCAST tasks take humans between one minute and 8+ hours. Measuring the time tasks take for humans provides an intuitive metric for evaluating AI capabilities, helping answer the question "can an agent be trusted to complete a task that would take a human X hours?" We evaluate the success rates of AI agents built on frontier foundation models, and we find that current agents succeed 70-80% of the time on tasks that take humans less than one hour, and less than 20% of the time on tasks that take humans more than 4 hours.
Related papers
- Measuring AI Ability to Complete Long Tasks [5.986082428339293]
We measure the time humans typically take to complete tasks that AI models can complete with 50% success rate.<n>Current frontier AI models such as Claude 3.7 Sonnet have a 50% time horizon of around 50 minutes.<n>The increase in AI models' time horizons seems to be driven by greater reliability and ability to adapt to mistakes.
arXiv Detail & Related papers (2025-03-18T17:59:31Z) - Evaluating Intelligence via Trial and Error [59.80426744891971]
We introduce Survival Game as a framework to evaluate intelligence based on the number of failed attempts in a trial-and-error process.<n>When the expectation and variance of failure counts are both finite, it signals the ability to consistently find solutions to new challenges.<n>Our results show that while AI systems achieve the Autonomous Level in simple tasks, they are still far from it in more complex tasks.
arXiv Detail & Related papers (2025-02-26T05:59:45Z) - TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks [52.46737975742287]
We build a self-contained environment with data that mimics a small software company environment.<n>We find that with the most competitive agent, 24% of the tasks can be completed autonomously.<n>This paints a nuanced picture on task automation with LM agents.
arXiv Detail & Related papers (2024-12-18T18:55:40Z) - RE-Bench: Evaluating frontier AI R&D capabilities of language model agents against human experts [4.112091541691995]
We introduce RE-Bench, which consists of 7 challenging, open-ended ML research engineering environments and data from 71 8-hour attempts by 61 human experts.
We find that the best AI agents achieve a score 4x higher than human experts when both are given a total time budget of 2 hours per environment.
Humans currently display better returns to increasing time budgets, narrowly exceeding the top AI agent scores given an 8-hour budget, and achieving 2x the score of the top AI agent when both are given 32 total hours (across different attempts).
arXiv Detail & Related papers (2024-11-22T18:30:46Z) - PARTNR: A Benchmark for Planning and Reasoning in Embodied Multi-agent Tasks [57.89516354418451]
We present a benchmark for Planning And Reasoning Tasks in humaN-Robot collaboration (PARTNR)
We employ a semi-automated task generation pipeline using Large Language Models (LLMs)
We analyze state-of-the-art LLMs on PARTNR tasks, across the axes of planning, perception and skill execution.
arXiv Detail & Related papers (2024-10-31T17:53:12Z) - WorkArena++: Towards Compositional Planning and Reasoning-based Common Knowledge Work Tasks [85.95607119635102]
Large language models (LLMs) can mimic human-like intelligence.<n>WorkArena++ is designed to evaluate the planning, problem-solving, logical/arithmetic reasoning, retrieval, and contextual understanding abilities of web agents.
arXiv Detail & Related papers (2024-07-07T07:15:49Z) - Self-Improving Robots: End-to-End Autonomous Visuomotor Reinforcement
Learning [54.636562516974884]
In imitation and reinforcement learning, the cost of human supervision limits the amount of data that robots can be trained on.
In this work, we propose MEDAL++, a novel design for self-improving robotic systems.
The robot autonomously practices the task by learning to both do and undo the task, simultaneously inferring the reward function from the demonstrations.
arXiv Detail & Related papers (2023-03-02T18:51:38Z) - Can Machines Imitate Humans? Integrative Turing Tests for Vision and Language Demonstrate a Narrowing Gap [45.6806234490428]
We benchmark current AIs in their abilities to imitate humans in three language tasks and three vision tasks.
Experiments involved 549 human agents plus 26 AI agents for dataset creation, and 1,126 human judges plus 10 AI judges.
Results reveal that current AIs are not far from being able to impersonate humans in complex language and vision challenges.
arXiv Detail & Related papers (2022-11-23T16:16:52Z) - Navigation Turing Test (NTT): Learning to Evaluate Human-Like Navigation [9.456752543341464]
A key challenge on the path to developing agents that learn complex human-like behavior is the need to quickly and accurately quantify human-likeness.
We address these limitations through a novel automated Navigation Turing Test (ANTT) that learns to predict human judgments of human-likeness.
arXiv Detail & Related papers (2021-05-20T10:14:23Z) - Watch-And-Help: A Challenge for Social Perception and Human-AI
Collaboration [116.28433607265573]
We introduce Watch-And-Help (WAH), a challenge for testing social intelligence in AI agents.
In WAH, an AI agent needs to help a human-like agent perform a complex household task efficiently.
We build VirtualHome-Social, a multi-agent household environment, and provide a benchmark including both planning and learning based baselines.
arXiv Detail & Related papers (2020-10-19T21:48:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.