FIRE-Bench: Evaluating Agents on the Rediscovery of Scientific Insights
- URL: http://arxiv.org/abs/2602.02905v1
- Date: Mon, 02 Feb 2026 23:21:13 GMT
- Title: FIRE-Bench: Evaluating Agents on the Rediscovery of Scientific Insights
- Authors: Zhen Wang, Fan Bai, Zhongyan Luo, Jinyan Su, Kaiser Sun, Xinle Yu, Jieyuan Liu, Kun Zhou, Claire Cardie, Mark Dredze, Eric P. Xing, Zhiting Hu,
- Abstract summary: We introduce FIRE-Bench (Full-cycle Insight Rediscovery Evaluation), a benchmark that evaluates agents through the rediscovery of established findings.<n>Even the strongest agents achieve limited rediscovery success (50 F1), exhibit high variance across runs, and display recurring failure modes in experimental design, execution, and evidence-based reasoning.
- Score: 63.32178443510396
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Autonomous agents powered by large language models (LLMs) promise to accelerate scientific discovery end-to-end, but rigorously evaluating their capacity for verifiable discovery remains a central challenge. Existing benchmarks face a trade-off: they either heavily rely on LLM-as-judge evaluations of automatically generated research outputs or optimize convenient yet isolated performance metrics that provide coarse proxies for scientific insight. To address this gap, we introduce FIRE-Bench (Full-cycle Insight Rediscovery Evaluation), a benchmark that evaluates agents through the rediscovery of established findings from recent, high-impact machine learning research. Agents are given only a high-level research question extracted from a published, verified study and must autonomously explore ideas, design experiments, implement code, execute their plans, and derive conclusions supported by empirical evidence. We evaluate a range of state-of-the-art agents with frontier LLMs backbones like gpt-5 on FIRE-Bench. Our results show that full-cycle scientific research remains challenging for current agent systems: even the strongest agents achieve limited rediscovery success (<50 F1), exhibit high variance across runs, and display recurring failure modes in experimental design, execution, and evidence-based reasoning. FIRE-Bench provides a rigorous and diagnostic framework for measuring progress toward reliable agent-driven scientific discovery.
Related papers
- The Story is Not the Science: Execution-Grounded Evaluation of Mechanistic Interpretability Research [56.80927148740585]
We address the challenges of scalability and rigor by flipping the dynamic and developing AI agents as research evaluators.<n>We use mechanistic interpretability research as a testbed, build standardized research output, and develop MechEvalAgent.<n>Our work demonstrates the potential of AI agents to transform research evaluation and pave the way for rigorous scientific practices.
arXiv Detail & Related papers (2026-02-05T19:00:02Z) - HeurekaBench: A Benchmarking Framework for AI Co-scientist [2.206319727896241]
HeurekaBench is a framework to create benchmarks with exploratory, open-ended research questions for experimental datasets.<n>We instantiate the framework in single-cell biology to obtain sc-HeurekaBench benchmark and use it to compare state-of-the-art single-cell agents.<n>We find that the addition of a critic module can improve ill-formed responses for open-source LLM-based agents by up to 22% and close the gap with their closed-source counterparts.
arXiv Detail & Related papers (2026-01-04T22:16:42Z) - Probing Scientific General Intelligence of LLMs with Scientist-Aligned Workflows [203.3527268311731]
We present an operational SGI definition grounded in the Practical Inquiry Model (PIM)<n>We operationalize it via four scientist-aligned tasks: deep research, idea generation, dry/wet experiments, and experimental reasoning.<n>Our PIM-grounded definition, workflow-centric benchmark, and empirical insights establish a foundation for AI systems that genuinely participate in scientific discovery.
arXiv Detail & Related papers (2025-12-18T12:44:36Z) - Evaluating Large Language Models in Scientific Discovery [91.732562776782]
Large language models (LLMs) are increasingly applied to scientific research, yet prevailing science benchmarks probe decontextualized knowledge.<n>We introduce a scenario-grounded benchmark that evaluates LLMs across biology, chemistry, materials, and physics.<n>The framework assesses models at two levels: (i) question-level accuracy on scenario-tied items and (ii) project-level performance.
arXiv Detail & Related papers (2025-12-17T16:20:03Z) - Dr.Mi-Bench: A Modular-integrated Benchmark for Scientific Deep Research Agent [52.876617746453995]
Dr.Mi-Bench is a Modular-integrated benchmark for scientific deep research (DR) agents.<n>Dr.Mi-Eval is a novel modular-integrated evaluation paradigm.
arXiv Detail & Related papers (2025-11-30T17:16:47Z) - InnovatorBench: Evaluating Agents' Ability to Conduct Innovative LLM Research [36.46396692622759]
We introduce InnovatorBench, a benchmark-platform pair for realistic, end-to-end assessment of agents performing Large Language Model (LLM) research.<n>It comprises 20 tasks spanning Data Construction, Filtering, Augmentation, Loss Design, Reward Design, and Scaffold Construction.<n>We also develop ResearchGym, a research environment offering rich action spaces, distributed and long-horizon execution, asynchronous monitoring, and snapshot saving.
arXiv Detail & Related papers (2025-10-31T16:22:23Z) - FML-bench: A Benchmark for Automatic ML Research Agents Highlighting the Importance of Exploration Breadth [43.606494515048524]
Large language models (LLMs) have sparked growing interest in automatic machine learning research agents.<n>Existing benchmarks tend to overemphasize engineering aspects while neglecting academic rigor.<n>We introduce FML-bench, a benchmark designed to evaluate automatic machine learning research agents on 8 diverse and fundamental machine learning research problems.
arXiv Detail & Related papers (2025-10-12T06:41:05Z) - Hypothesis Hunting with Evolving Networks of Autonomous Scientific Agents [52.50038914857797]
We term this process hypothesis hunting: the cumulative search for insight through sustained exploration across vast and complex hypothesis spaces.<n>We introduce AScience, a framework modeling discovery as the interaction of agents, networks, and evaluation norms, and implement it as ASCollab.<n> Experiments show that such social dynamics enable the accumulation of expert-rated results along the diversity-quality-novelty frontier.
arXiv Detail & Related papers (2025-10-08T08:47:07Z) - MLR-Bench: Evaluating AI Agents on Open-Ended Machine Learning Research [70.72318131988102]
MLR-Bench is a comprehensive benchmark for evaluating AI agents on open-ended machine learning research.<n>MLR-Bench includes three key components: (1) 201 research tasks sourced from NeurIPS, ICLR, and ICML workshops covering diverse ML topics; (2) MLR-Judge, an automated evaluation framework combining LLM-based reviewers with carefully designed review rubrics to assess research quality; and (3) MLR-Agent, a modular agent scaffold capable of completing research tasks through four stages: idea generation, proposal formulation, experimentation, and paper writing.
arXiv Detail & Related papers (2025-05-26T13:18:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.