CORE-Bench: Fostering the Credibility of Published Research Through a Computational Reproducibility Agent Benchmark
- URL: http://arxiv.org/abs/2409.11363v1
- Date: Tue, 17 Sep 2024 17:13:19 GMT
- Title: CORE-Bench: Fostering the Credibility of Published Research Through a Computational Reproducibility Agent Benchmark
- Authors: Zachary S. Siegel, Sayash Kapoor, Nitya Nagdir, Benedikt Stroebl, Arvind Narayanan,
- Abstract summary: CORE-Bench is a benchmark consisting of 270 tasks based on 90 scientific papers across three disciplines (computer science, social science, and medicine)
We provide an evaluation system to measure the accuracy of agents in a fast and parallelizable way.
The best agent achieved an accuracy of 21% on the hardest task, showing the vast scope for improvement in automating routine scientific tasks.
- Score: 11.794931453828974
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: AI agents have the potential to aid users on a variety of consequential tasks, including conducting scientific research. To spur the development of useful agents, we need benchmarks that are challenging, but more crucially, directly correspond to real-world tasks of interest. This paper introduces such a benchmark, designed to measure the accuracy of AI agents in tackling a crucial yet surprisingly challenging aspect of scientific research: computational reproducibility. This task, fundamental to the scientific process, involves reproducing the results of a study using the provided code and data. We introduce CORE-Bench (Computational Reproducibility Agent Benchmark), a benchmark consisting of 270 tasks based on 90 scientific papers across three disciplines (computer science, social science, and medicine). Tasks in CORE-Bench consist of three difficulty levels and include both language-only and vision-language tasks. We provide an evaluation system to measure the accuracy of agents in a fast and parallelizable way, saving days of evaluation time for each run compared to a sequential implementation. We evaluated two baseline agents: the general-purpose AutoGPT and a task-specific agent called CORE-Agent. We tested both variants using two underlying language models: GPT-4o and GPT-4o-mini. The best agent achieved an accuracy of 21% on the hardest task, showing the vast scope for improvement in automating routine scientific tasks. Having agents that can reproduce existing work is a necessary step towards building agents that can conduct novel research and could verify and improve the performance of other research agents. We hope that CORE-Bench can improve the state of reproducibility and spur the development of future research agents.
Related papers
- ML Research Benchmark [0.0]
We present the ML Research Benchmark (MLRB), comprising 7 competition-level tasks derived from recent machine learning conference tracks.
This paper introduces a novel benchmark and evaluates it using agent scaffolds powered by frontier models, including Claude-3 and GPT-4o.
The results indicate that the Claude-3.5 Sonnet agent performs best across our benchmark, excelling in planning and developing machine learning models.
arXiv Detail & Related papers (2024-10-29T21:38:42Z) - ScienceAgentBench: Toward Rigorous Assessment of Language Agents for Data-Driven Scientific Discovery [23.773528748933934]
We extract 102 tasks from 44 peer-reviewed publications in four disciplines and engage nine subject matter experts to validate them.
We unify the target output for every task to a self-contained Python program file.
We propose two effective strategies to mitigate data contamination concerns.
arXiv Detail & Related papers (2024-10-07T14:33:50Z) - Gödel Agent: A Self-Referential Agent Framework for Recursive Self-Improvement [117.94654815220404]
G"odel Agent is a self-evolving framework inspired by the G"odel machine.
G"odel Agent can achieve continuous self-improvement, surpassing manually crafted agents in performance, efficiency, and generalizability.
arXiv Detail & Related papers (2024-10-06T10:49:40Z) - ExACT: Teaching AI Agents to Explore with Reflective-MCTS and Exploratory Learning [78.42927884000673]
ExACT is an approach to combine test-time search and self-learning to build o1-like models for agentic applications.
We first introduce Reflective Monte Carlo Tree Search (R-MCTS), a novel test time algorithm designed to enhance AI agents' ability to explore decision space on the fly.
Next, we introduce Exploratory Learning, a novel learning strategy to teach agents to search at inference time without relying on any external search algorithms.
arXiv Detail & Related papers (2024-10-02T21:42:35Z) - SUPER: Evaluating Agents on Setting Up and Executing Tasks from Research Repositories [55.161075901665946]
Super aims to capture the realistic challenges faced by researchers working with Machine Learning (ML) and Natural Language Processing (NLP) research repositories.
Our benchmark comprises three distinct problem sets: 45 end-to-end problems with annotated expert solutions, 152 sub problems derived from the expert set that focus on specific challenges, and 602 automatically generated problems for larger-scale development.
We show that state-of-the-art approaches struggle to solve these problems with the best model (GPT-4o) solving only 16.3% of the end-to-end set, and 46.1% of the scenarios.
arXiv Detail & Related papers (2024-09-11T17:37:48Z) - LAB-Bench: Measuring Capabilities of Language Models for Biology Research [1.6312096924271486]
We introduce the Language Agent Biology Benchmark (LAB-Bench)
It is a dataset of over 2,400 multiple choice questions for evaluating AI systems on a range of practical biology research capabilities.
We measure performance of several frontier language models against our benchmark and report results compared to human expert biology researchers.
arXiv Detail & Related papers (2024-07-14T23:52:25Z) - Tree Search for Language Model Agents [69.43007235771383]
We propose an inference-time search algorithm for LM agents to perform exploration and multi-step planning in interactive web environments.
Our approach is a form of best-first tree search that operates within the actual environment space.
It is the first tree search algorithm for LM agents that shows effectiveness on realistic web tasks.
arXiv Detail & Related papers (2024-07-01T17:07:55Z) - DISCOVERYWORLD: A Virtual Environment for Developing and Evaluating Automated Scientific Discovery Agents [49.74065769505137]
We introduce DISCOVERYWORLD, the first virtual environment for developing and benchmarking an agent's ability to perform complete cycles of novel scientific discovery.
It includes 120 different challenge tasks spanning eight topics each with three levels of difficulty and several parametric variations.
We find that strong baseline agents, that perform well in prior published environments, struggle on most DISCOVERYWORLD tasks.
arXiv Detail & Related papers (2024-06-10T20:08:44Z) - AgentGym: Evolving Large Language Model-based Agents across Diverse Environments [116.97648507802926]
Large language models (LLMs) are considered a promising foundation to build such agents.
We take the first step towards building generally-capable LLM-based agents with self-evolution ability.
We propose AgentGym, a new framework featuring a variety of environments and tasks for broad, real-time, uni-format, and concurrent agent exploration.
arXiv Detail & Related papers (2024-06-06T15:15:41Z) - AgentQuest: A Modular Benchmark Framework to Measure Progress and Improve LLM Agents [19.439775106707344]
AgentQuest is a framework where benchmarks and metrics are modular and easily through well documented and easy-to-use APIs.
We offer two new evaluation metrics that can reliably track LLM agent progress while solving a task.
We exemplify the utility of the metrics on two use cases wherein we identify common failure points and refine the agent architecture to obtain a significant performance increase.
arXiv Detail & Related papers (2024-04-09T16:01:24Z) - Harnessing Pre-trained Generalist Agents for Software Engineering Tasks [13.733085206098258]
Deep reinforcement learning (DRL) has been successfully used for automation in complex tasks such as game testing and solving the job-shop scheduling problem.
Specialist DRL agents suffer from a lack of generalizability to other tasks and need substantial time to be developed and re-trained effectively.
Recently, DRL researchers have begun to develop generalist agents, able to learn a policy from various environments and capable of achieving performances similar to or better than specialist agents in new tasks.
arXiv Detail & Related papers (2023-12-24T18:39:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.