CORE-Bench: Fostering the Credibility of Published Research Through a Computational Reproducibility Agent Benchmark
- URL: http://arxiv.org/abs/2409.11363v1
- Date: Tue, 17 Sep 2024 17:13:19 GMT
- Title: CORE-Bench: Fostering the Credibility of Published Research Through a Computational Reproducibility Agent Benchmark
- Authors: Zachary S. Siegel, Sayash Kapoor, Nitya Nagdir, Benedikt Stroebl, Arvind Narayanan,
- Abstract summary: CORE-Bench is a benchmark consisting of 270 tasks based on 90 scientific papers across three disciplines (computer science, social science, and medicine)
We provide an evaluation system to measure the accuracy of agents in a fast and parallelizable way.
The best agent achieved an accuracy of 21% on the hardest task, showing the vast scope for improvement in automating routine scientific tasks.
- Score: 11.794931453828974
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: AI agents have the potential to aid users on a variety of consequential tasks, including conducting scientific research. To spur the development of useful agents, we need benchmarks that are challenging, but more crucially, directly correspond to real-world tasks of interest. This paper introduces such a benchmark, designed to measure the accuracy of AI agents in tackling a crucial yet surprisingly challenging aspect of scientific research: computational reproducibility. This task, fundamental to the scientific process, involves reproducing the results of a study using the provided code and data. We introduce CORE-Bench (Computational Reproducibility Agent Benchmark), a benchmark consisting of 270 tasks based on 90 scientific papers across three disciplines (computer science, social science, and medicine). Tasks in CORE-Bench consist of three difficulty levels and include both language-only and vision-language tasks. We provide an evaluation system to measure the accuracy of agents in a fast and parallelizable way, saving days of evaluation time for each run compared to a sequential implementation. We evaluated two baseline agents: the general-purpose AutoGPT and a task-specific agent called CORE-Agent. We tested both variants using two underlying language models: GPT-4o and GPT-4o-mini. The best agent achieved an accuracy of 21% on the hardest task, showing the vast scope for improvement in automating routine scientific tasks. Having agents that can reproduce existing work is a necessary step towards building agents that can conduct novel research and could verify and improve the performance of other research agents. We hope that CORE-Bench can improve the state of reproducibility and spur the development of future research agents.
Related papers
- ReplicatorBench: Benchmarking LLM Agents for Replicability in Social and Behavioral Sciences [19.81372090301296]
ReplicatorBench is an end-to-end benchmark for evaluating AI agents in research replication across three stages.<n>We develop ReplicatorAgent, an agentic framework equipped with necessary tools, like web search and iterative interaction with sandboxed environments.<n>We evaluate ReplicatorAgent across four underlying large language models (LLMs), as well as different design choices of programming language and levels of code access.
arXiv Detail & Related papers (2026-02-11T20:42:10Z) - AIRS-Bench: a Suite of Tasks for Frontier AI Research Science Agents [49.67355440164857]
We introduce AIRS-Bench, a suite of 20 tasks sourced from state-of-the-art machine learning papers.<n>Airs-Bench tasks assess agentic capabilities over the full research lifecycle.<n>We open-source the AIRS-Bench task definitions and evaluation code to catalyze further development in autonomous scientific research.
arXiv Detail & Related papers (2026-02-06T16:45:02Z) - ReplicationBench: Can AI Agents Replicate Astrophysics Research Papers? [29.17900668495058]
We introduce ReplicationBench, an evaluation framework for frontier AI agents.<n>It tests whether agents can replicate entire research papers drawn from the astrophysics literature.<n>R ReplicationBench establishes the first benchmark of paper-scale, expert-validated astrophysics research tasks.
arXiv Detail & Related papers (2025-10-28T16:21:19Z) - AstaBench: Rigorous Benchmarking of AI Agents with a Scientific Research Suite [75.58737079136942]
We present AstaBench, a suite that provides the first holistic measure of agentic ability to perform scientific research.<n>Our suite comes with the first scientific research environment with production-grade search tools.<n>Our evaluation of 57 agents across 22 agent classes reveals several interesting findings.
arXiv Detail & Related papers (2025-10-24T17:10:26Z) - Towards Self-Evolving Benchmarks: Synthesizing Agent Trajectories via Test-Time Exploration under Validate-by-Reproduce Paradigm [60.36837655498119]
We propose a Trajectory-based validated-by-Reproducing Agent-benchmark Complexity Evolution framework.<n>This framework takes an original task from an existing benchmark and encourages agents to evolve it into a new task with higher difficulty.<n>Experiments on the GAIA benchmark demonstrate that the TRACE framework consistently enhances task complexity while improving the reliability of correctness.
arXiv Detail & Related papers (2025-10-01T01:52:52Z) - REPRO-Bench: Can Agentic AI Systems Assess the Reproducibility of Social Science Research? [2.111102681327218]
Existing benchmarks for reproducing research papers focus solely on reproducing results using provided code and data.<n>We introduce REPRO-Bench, a collection of 112 task instances, each representing a social science paper with a publicly available reproduction report.<n>We evaluate three representative AI agents on REPRO-Bench, with the best-performing agent achieving an accuracy of only 21.4%.
arXiv Detail & Related papers (2025-07-25T02:48:30Z) - Establishing Best Practices for Building Rigorous Agentic Benchmarks [94.69724201080155]
We show that many agentic benchmarks have issues in task setup or reward design.<n>Such issues can lead to under- or overestimation of agents' performance by up to 100% in relative terms.<n>We introduce the Agentic Benchmark Checklist (ABC), a set of guidelines that we synthesized from our benchmark-building experience.
arXiv Detail & Related papers (2025-07-03T17:35:31Z) - From Reproduction to Replication: Evaluating Research Agents with Progressive Code Masking [48.90371827091671]
AutoExperiment is a benchmark that evaluates AI agents' ability to implement and run machine learning experiments.<n>We evaluate state-of-the-art agents and find that performance degrades rapidly as $n$ increases.<n>Our findings highlight critical challenges in long-horizon code generation, context retrieval, and autonomous experiment execution.
arXiv Detail & Related papers (2025-06-24T15:39:20Z) - Understanding Software Engineering Agents Through the Lens of Traceability: An Empirical Study [15.97770416681533]
Software engineering agents (SWE agents) operate autonomously by interpreting user input and responding to environmental feedback.<n>We present the first systematic study of SWE agent behavior through the lens of execution traces.
arXiv Detail & Related papers (2025-06-10T00:41:54Z) - Self-Challenging Language Model Agents [98.62637336505242]
We propose the Self-Challenging framework for training an agent on high-quality tasks that are generated by itself.<n>The framework achieves over a two-fold improvement in Llama-3.1-8B-Instruct, despite using only self-generated training data.
arXiv Detail & Related papers (2025-06-02T14:23:33Z) - R&D-Agent: An LLM-Agent Framework Towards Autonomous Data Science [70.1638335489284]
High-level machine learning engineering tasks remain labor-intensive and iterative.<n>We introduce R&D-Agent, a comprehensive, decoupled, and framework that formalizes the machine learning process.<n>R&D-Agent defines the MLE into two phases and six components, turning agent design for MLE into a principled, testable process.
arXiv Detail & Related papers (2025-05-20T06:07:00Z) - ResearchCodeAgent: An LLM Multi-Agent System for Automated Codification of Research Methodologies [16.90884865239373]
We introduce ResearchCodeAgent, a novel multi-agent system to automate the codification of research methodologies.
The system bridges the gap between high-level research concepts and their practical implementation.
ResearchCodeAgent represents a significant step towards the research implementation process, potentially accelerating the pace of machine learning research.
arXiv Detail & Related papers (2025-04-28T07:18:45Z) - MLGym: A New Framework and Benchmark for Advancing AI Research Agents [51.9387884953294]
We introduce Meta MLGym and MLGym-Bench, a new framework and benchmark for evaluating and developing large language models on AI research tasks.
This is the first Gym environment for machine learning (ML) tasks, enabling research on reinforcement learning (RL) algorithms for training such agents.
We evaluate a number of frontier large language models (LLMs) on our benchmarks such as Claude-3.5-Sonnet, Llama-3.1 405B, GPT-4o, o1-preview, and Gemini-1.5 Pro.
arXiv Detail & Related papers (2025-02-20T12:28:23Z) - ML Research Benchmark [0.0]
We present the ML Research Benchmark (MLRB), comprising 7 competition-level tasks derived from recent machine learning conference tracks.
This paper introduces a novel benchmark and evaluates it using agent scaffolds powered by frontier models, including Claude-3 and GPT-4o.
The results indicate that the Claude-3.5 Sonnet agent performs best across our benchmark, excelling in planning and developing machine learning models.
arXiv Detail & Related papers (2024-10-29T21:38:42Z) - ScienceAgentBench: Toward Rigorous Assessment of Language Agents for Data-Driven Scientific Discovery [23.773528748933934]
We extract 102 tasks from 44 peer-reviewed publications in four disciplines and engage nine subject matter experts to validate them.
We unify the target output for every task to a self-contained Python program file.
We propose two effective strategies to mitigate data contamination concerns.
arXiv Detail & Related papers (2024-10-07T14:33:50Z) - Gödel Agent: A Self-Referential Agent Framework for Recursive Self-Improvement [117.94654815220404]
G"odel Agent is a self-evolving framework inspired by the G"odel machine.
G"odel Agent can achieve continuous self-improvement, surpassing manually crafted agents in performance, efficiency, and generalizability.
arXiv Detail & Related papers (2024-10-06T10:49:40Z) - ExACT: Teaching AI Agents to Explore with Reflective-MCTS and Exploratory Learning [78.42927884000673]
ExACT is an approach to combine test-time search and self-learning to build o1-like models for agentic applications.
We first introduce Reflective Monte Carlo Tree Search (R-MCTS), a novel test time algorithm designed to enhance AI agents' ability to explore decision space on the fly.
Next, we introduce Exploratory Learning, a novel learning strategy to teach agents to search at inference time without relying on any external search algorithms.
arXiv Detail & Related papers (2024-10-02T21:42:35Z) - SUPER: Evaluating Agents on Setting Up and Executing Tasks from Research Repositories [55.161075901665946]
Super aims to capture the realistic challenges faced by researchers working with Machine Learning (ML) and Natural Language Processing (NLP) research repositories.
Our benchmark comprises three distinct problem sets: 45 end-to-end problems with annotated expert solutions, 152 sub problems derived from the expert set that focus on specific challenges, and 602 automatically generated problems for larger-scale development.
We show that state-of-the-art approaches struggle to solve these problems with the best model (GPT-4o) solving only 16.3% of the end-to-end set, and 46.1% of the scenarios.
arXiv Detail & Related papers (2024-09-11T17:37:48Z) - LAB-Bench: Measuring Capabilities of Language Models for Biology Research [1.6312096924271486]
We introduce the Language Agent Biology Benchmark (LAB-Bench)
It is a dataset of over 2,400 multiple choice questions for evaluating AI systems on a range of practical biology research capabilities.
We measure performance of several frontier language models against our benchmark and report results compared to human expert biology researchers.
arXiv Detail & Related papers (2024-07-14T23:52:25Z) - Tree Search for Language Model Agents [69.43007235771383]
We propose an inference-time search algorithm for LM agents to perform exploration and multi-step planning in interactive web environments.
Our approach is a form of best-first tree search that operates within the actual environment space.
It is the first tree search algorithm for LM agents that shows effectiveness on realistic web tasks.
arXiv Detail & Related papers (2024-07-01T17:07:55Z) - DISCOVERYWORLD: A Virtual Environment for Developing and Evaluating Automated Scientific Discovery Agents [49.74065769505137]
We introduce DISCOVERYWORLD, the first virtual environment for developing and benchmarking an agent's ability to perform complete cycles of novel scientific discovery.
It includes 120 different challenge tasks spanning eight topics each with three levels of difficulty and several parametric variations.
We find that strong baseline agents, that perform well in prior published environments, struggle on most DISCOVERYWORLD tasks.
arXiv Detail & Related papers (2024-06-10T20:08:44Z) - AgentGym: Evolving Large Language Model-based Agents across Diverse Environments [116.97648507802926]
Large language models (LLMs) are considered a promising foundation to build such agents.
We take the first step towards building generally-capable LLM-based agents with self-evolution ability.
We propose AgentGym, a new framework featuring a variety of environments and tasks for broad, real-time, uni-format, and concurrent agent exploration.
arXiv Detail & Related papers (2024-06-06T15:15:41Z) - ResearchAgent: Iterative Research Idea Generation over Scientific Literature with Large Language Models [56.08917291606421]
ResearchAgent is an AI-based system for ideation and operationalization of novel work.
ResearchAgent automatically defines novel problems, proposes methods and designs experiments, while iteratively refining them.
We experimentally validate our ResearchAgent on scientific publications across multiple disciplines.
arXiv Detail & Related papers (2024-04-11T13:36:29Z) - AgentQuest: A Modular Benchmark Framework to Measure Progress and Improve LLM Agents [19.439775106707344]
AgentQuest is a framework where benchmarks and metrics are modular and easily through well documented and easy-to-use APIs.
We offer two new evaluation metrics that can reliably track LLM agent progress while solving a task.
We exemplify the utility of the metrics on two use cases wherein we identify common failure points and refine the agent architecture to obtain a significant performance increase.
arXiv Detail & Related papers (2024-04-09T16:01:24Z) - Harnessing Pre-trained Generalist Agents for Software Engineering Tasks [13.733085206098258]
Deep reinforcement learning (DRL) has been successfully used for automation in complex tasks such as game testing and solving the job-shop scheduling problem.
Specialist DRL agents suffer from a lack of generalizability to other tasks and need substantial time to be developed and re-trained effectively.
Recently, DRL researchers have begun to develop generalist agents, able to learn a policy from various environments and capable of achieving performances similar to or better than specialist agents in new tasks.
arXiv Detail & Related papers (2023-12-24T18:39:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.