Related papers: EXP-Bench: Can AI Conduct AI Research Experiments?

EXP-Bench: Can AI Conduct AI Research Experiments?

URL: http://arxiv.org/abs/2505.24785v2
Date: Mon, 02 Jun 2025 01:59:50 GMT
Title: EXP-Bench: Can AI Conduct AI Research Experiments?
Authors: Patrick Tser Jern Kon, Jiachen Liu, Xinyi Zhu, Qiuyi Ding, Jingjia Peng, Jiarong Xing, Yibo Huang, Yiming Qiu, Jayanth Srinivasa, Myungjin Lee, Mosharaf Chowdhury, Matei Zaharia, Ang Chen,
Abstract summary: We introduce EXP-Bench, a novel benchmark designed to evaluate AI agents on complete research experiments.<n>Given a research question and incomplete starter code, EXP-Bench challenges AI agents to formulate hypotheses, design and implement experimental procedures, execute them, and analyze results.<n>With the pipeline, EXP-Bench curated 461 AI research tasks from 51 top-tier AI research papers.
Score: 38.30861763360086
License: http://creativecommons.org/publicdomain/zero/1.0/
Abstract: Automating AI research holds immense potential for accelerating scientific progress, yet current AI agents struggle with the complexities of rigorous, end-to-end experimentation. We introduce EXP-Bench, a novel benchmark designed to systematically evaluate AI agents on complete research experiments sourced from influential AI publications. Given a research question and incomplete starter code, EXP-Bench challenges AI agents to formulate hypotheses, design and implement experimental procedures, execute them, and analyze results. To enable the creation of such intricate and authentic tasks with high-fidelity, we design a semi-autonomous pipeline to extract and structure crucial experimental details from these research papers and their associated open-source code. With the pipeline, EXP-Bench curated 461 AI research tasks from 51 top-tier AI research papers. Evaluations of leading LLM-based agents, such as OpenHands and IterativeAgent on EXP-Bench demonstrate partial capabilities: while scores on individual experimental aspects such as design or implementation correctness occasionally reach 20-35%, the success rate for complete, executable experiments was a mere 0.5%. By identifying these bottlenecks and providing realistic step-by-step experiment procedures, EXP-Bench serves as a vital tool for future AI agents to improve their ability to conduct AI research experiments. EXP-Bench is open-sourced at https://github.com/Just-Curieous/Curie/tree/main/benchmark/exp_bench.

Related papers

AblationBench: Evaluating Automated Planning of Ablations in Empirical AI Research [34.173947968362675]
AblationBench is a benchmark suite for evaluating agents on ablation planning tasks in empirical AI research.<n>It includes two tasks: AuthorAblation, which helps authors propose ablation experiments based on a method section, and ReviewerAblation, which helps reviewers find missing ablations in a full paper.<n>For both tasks, we develop LM-based judges that serve as an automatic evaluation framework.
arXiv Detail & Related papers (2025-07-09T12:07:38Z)
SciMaster: Towards General-Purpose Scientific AI Agents, Part I. X-Master as Foundation: Can We Lead on Humanity's Last Exam? [51.112225746095746]
We introduce X-Master, a tool-augmented reasoning agent designed to emulate human researchers.<n>X-Masters sets a new state-of-the-art record on Humanity's Last Exam with a score of 32.1%.
arXiv Detail & Related papers (2025-07-07T17:50:52Z)
RExBench: Can coding agents autonomously implement AI research extensions? [14.147417159347448]
Agents based on Large Language Models (LLMs) have shown promise for performing sophisticated software engineering tasks autonomously.<n>We argue that research extension and its implementation is a critical capability for such systems.<n>We introduce RExBench to support the evaluation of this capability.
arXiv Detail & Related papers (2025-06-27T19:41:41Z)
From Reproduction to Replication: Evaluating Research Agents with Progressive Code Masking [48.90371827091671]
AutoExperiment is a benchmark that evaluates AI agents' ability to implement and run machine learning experiments.<n>We evaluate state-of-the-art agents and find that performance degrades rapidly as $n$ increases.<n>Our findings highlight critical challenges in long-horizon code generation, context retrieval, and autonomous experiment execution.
arXiv Detail & Related papers (2025-06-24T15:39:20Z)
Curie: Toward Rigorous and Automated Scientific Experimentation with AI Agents [21.001278669360346]
We propose Curie, an AI framework designed to embed rigor into the experimentation process.<n>Curie includes an intra-agent rigor module to enhance reliability, an inter-agent rigor module to maintain methodical control, and an experiment knowledge module to enhance interpretability.<n>Compared to the strongest baseline tested, we achieve a 3.4$times$ improvement in correctly answering experimental questions.
arXiv Detail & Related papers (2025-02-22T03:58:19Z)
Autonomous Microscopy Experiments through Large Language Model Agents [4.241267255764773]
Large language models (LLMs) have accelerated the development of self-driving laboratories (SDLs) for materials research.<n>Here, we introduce AILA (Artificially Intelligent Lab Assistant), a framework that automates atomic force microscopy (AFM) through LLM-driven agents.<n>Our systematic assessment shows that state-of-the-art language models struggle even with basic tasks such as documentation retrieval.
arXiv Detail & Related papers (2024-12-18T09:35:28Z)
ML Research Benchmark [0.0]
We present the ML Research Benchmark (MLRB), comprising 7 competition-level tasks derived from recent machine learning conference tracks. This paper introduces a novel benchmark and evaluates it using agent scaffolds powered by frontier models, including Claude-3 and GPT-4o. The results indicate that the Claude-3.5 Sonnet agent performs best across our benchmark, excelling in planning and developing machine learning models.
arXiv Detail & Related papers (2024-10-29T21:38:42Z)
DISCOVERYWORLD: A Virtual Environment for Developing and Evaluating Automated Scientific Discovery Agents [49.74065769505137]
We introduce DISCOVERYWORLD, the first virtual environment for developing and benchmarking an agent's ability to perform complete cycles of novel scientific discovery. It includes 120 different challenge tasks spanning eight topics each with three levels of difficulty and several parametric variations. We find that strong baseline agents, that perform well in prior published environments, struggle on most DISCOVERYWORLD tasks.
arXiv Detail & Related papers (2024-06-10T20:08:44Z)
"Turing Tests" For An AI Scientist [0.0]
This paper proposes a "Turing test for an AI scientist" to assess whether an AI agent can conduct scientific research independently. We propose seven benchmark tests that evaluate an AI agent's ability to make groundbreaking discoveries in various scientific domains.
arXiv Detail & Related papers (2024-05-22T05:14:27Z)
MLXP: A Framework for Conducting Replicable Experiments in Python [63.37350735954699]
We propose MLXP, an open-source, simple, and lightweight experiment management tool based on Python. It streamlines the experimental process with minimal overhead while ensuring a high level of practitioner overhead.
arXiv Detail & Related papers (2024-02-21T14:22:20Z)
MLAgentBench: Evaluating Language Agents on Machine Learning Experimentation [96.71370747681078]
We introduce MLAgentBench, a suite of 13 tasks ranging from improving model performance on CIFAR-10 to recent research problems like BabyLM. For each task, an agent can perform actions like reading/writing files, executing code, and inspecting outputs. We benchmark agents based on Claude v1.0, Claude v2.1, Claude v3 Opus, GPT-4, GPT-4-turbo, Gemini-Pro, and Mixtral and find that a Claude v3 Opus agent is the best in terms of success rate.
arXiv Detail & Related papers (2023-10-05T04:06:12Z)
PyExperimenter: Easily distribute experiments and track results [63.871474825689134]
PyExperimenter is a tool to facilitate the setup, documentation, execution, and subsequent evaluation of results from an empirical study of algorithms. It is intended to be used by researchers in the field of artificial intelligence, but is not limited to those.
arXiv Detail & Related papers (2023-01-16T10:43:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.