FML-bench: A Benchmark for Automatic ML Research Agents Highlighting the Importance of Exploration Breadth
- URL: http://arxiv.org/abs/2510.10472v1
- Date: Sun, 12 Oct 2025 06:41:05 GMT
- Title: FML-bench: A Benchmark for Automatic ML Research Agents Highlighting the Importance of Exploration Breadth
- Authors: Qiran Zou, Hou Hei Lam, Wenhao Zhao, Yiming Tang, Tingting Chen, Samson Yu, Tianyi Zhang, Chang Liu, Xiangyang Ji, Dianbo Liu,
- Abstract summary: Large language models (LLMs) have sparked growing interest in automatic machine learning research agents.<n>Existing benchmarks tend to overemphasize engineering aspects while neglecting academic rigor.<n>We introduce FML-bench, a benchmark designed to evaluate automatic machine learning research agents on 8 diverse and fundamental machine learning research problems.
- Score: 43.606494515048524
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) have sparked growing interest in automatic machine learning research agents. Among them, agents capable of autonomously proposing ideas and conducting machine learning experiments are particularly promising, as they maximize research automation and accelerate scientific progress by iteratively refining ideas based on experimental results. However, comprehensively evaluating such agents remains challenging. Existing benchmarks tend to overemphasize engineering aspects while neglecting academic rigor, creating barriers that obscure a clear assessment of an agent's scientific capabilities in machine learning research. They also suffer from limited task diversity, an overemphasis on application-oriented tasks over fundamental research problems, and limited scalability to realistic research settings. To address these limitations, we introduce FML-bench, a benchmark designed to evaluate automatic machine learning research agents on 8 diverse and fundamental machine learning research problems. It reduces coding burden, emphasizes fundamental problems rather than specific use cases, offers high task diversity, and is extensible to real-world machine learning GitHub repositories. Furthermore, we present a unified evaluation framework with five complementary metrics, designed to comprehensively assess agent performance on our benchmark. We evaluate state-of-the-art automatic research agents on FML-bench, and find that agents employing broad research exploration strategies outperform those focusing on narrow but deep exploration. These findings suggest that emphasizing the breadth of exploration may lead to more effective research outcomes than focusing solely on incremental refinement. Our benchmark is available at https://github.com/qrzou/FML-bench.
Related papers
- The Story is Not the Science: Execution-Grounded Evaluation of Mechanistic Interpretability Research [56.80927148740585]
We address the challenges of scalability and rigor by flipping the dynamic and developing AI agents as research evaluators.<n>We use mechanistic interpretability research as a testbed, build standardized research output, and develop MechEvalAgent.<n>Our work demonstrates the potential of AI agents to transform research evaluation and pave the way for rigorous scientific practices.
arXiv Detail & Related papers (2026-02-05T19:00:02Z) - FIRE-Bench: Evaluating Agents on the Rediscovery of Scientific Insights [63.32178443510396]
We introduce FIRE-Bench (Full-cycle Insight Rediscovery Evaluation), a benchmark that evaluates agents through the rediscovery of established findings.<n>Even the strongest agents achieve limited rediscovery success (50 F1), exhibit high variance across runs, and display recurring failure modes in experimental design, execution, and evidence-based reasoning.
arXiv Detail & Related papers (2026-02-02T23:21:13Z) - The Role of AI in Modern Penetration Testing [0.0]
Penetration testing is a cornerstone of cybersecurity, traditionally driven by manual, time-intensive processes.<n>This systematic literature review examines how Artificial Intelligence (AI) is reshaping penetration testing.
arXiv Detail & Related papers (2025-12-13T13:34:31Z) - SelfAI: Building a Self-Training AI System with LLM Agents [79.10991818561907]
SelfAI is a general multi-agent platform that combines a User Agent for translating high-level research objectives into standardized experimental configurations.<n>An Experiment Manager orchestrates parallel, fault-tolerant training across heterogeneous hardware while maintaining a structured knowledge base for continuous feedback.<n>Across regression, computer vision, scientific computing, medical imaging, and drug discovery benchmarks, SelfAI consistently achieves strong performance and reduces redundant trials.
arXiv Detail & Related papers (2025-11-29T09:18:39Z) - AstaBench: Rigorous Benchmarking of AI Agents with a Scientific Research Suite [75.58737079136942]
We present AstaBench, a suite that provides the first holistic measure of agentic ability to perform scientific research.<n>Our suite comes with the first scientific research environment with production-grade search tools.<n>Our evaluation of 57 agents across 22 agent classes reveals several interesting findings.
arXiv Detail & Related papers (2025-10-24T17:10:26Z) - RE-Searcher: Robust Agentic Search with Goal-oriented Planning and Self-reflection [55.125987985864896]
We present a systematic analysis that quantifies how environmental complexity induces fragile search behaviors.<n>We propose a simple yet effective approach to instantiate a search agent, RE-Searcher.<n>This combination of goal-oriented planning and self-reflection enables RE-Searcher to resist spurious cues in complex search environments.
arXiv Detail & Related papers (2025-09-30T10:25:27Z) - Can Language Models Critique Themselves? Investigating Self-Feedback for Retrieval Augmented Generation at BioASQ 2025 [1.6819960041696331]
RAG and 'deep research' systems aim to enable autonomous search processes where Large Language Models (LLMs) iteratively refine outputs.<n>Applying these systems to domain-specific professional search, such as biomedical research, presents challenges.<n>We investigated whether this iterative self-correction improves performance and if reasoning models are more capable of generating useful feedback.
arXiv Detail & Related papers (2025-08-07T13:13:19Z) - MLR-Bench: Evaluating AI Agents on Open-Ended Machine Learning Research [70.72318131988102]
MLR-Bench is a comprehensive benchmark for evaluating AI agents on open-ended machine learning research.<n>MLR-Bench includes three key components: (1) 201 research tasks sourced from NeurIPS, ICLR, and ICML workshops covering diverse ML topics; (2) MLR-Judge, an automated evaluation framework combining LLM-based reviewers with carefully designed review rubrics to assess research quality; and (3) MLR-Agent, a modular agent scaffold capable of completing research tasks through four stages: idea generation, proposal formulation, experimentation, and paper writing.
arXiv Detail & Related papers (2025-05-26T13:18:37Z) - MLRC-Bench: Can Language Agents Solve Machine Learning Research Challenges? [64.62421656031128]
MLRC-Bench is a benchmark designed to quantify how effectively language agents can tackle challenging Machine Learning (ML) Research Competitions.<n>Unlike prior work, MLRC-Bench measures the key steps of proposing and implementing novel research methods.<n>Even the best-performing tested agent closes only 9.3% of the gap between baseline and top human participant scores.
arXiv Detail & Related papers (2025-04-13T19:35:43Z) - Autonomous Microscopy Experiments through Large Language Model Agents [4.241267255764773]
Large language models (LLMs) are revolutionizing self driving laboratories (SDLs) for materials research.<n>We introduce Artificially Intelligent Lab Assistant (AILA), a framework automating atomic force microscopy through LLM driven agents.<n>We find that state of the art models struggle with basic tasks and coordination scenarios.
arXiv Detail & Related papers (2024-12-18T09:35:28Z) - ML Research Benchmark [0.0]
We present the ML Research Benchmark (MLRB), comprising 7 competition-level tasks derived from recent machine learning conference tracks.
This paper introduces a novel benchmark and evaluates it using agent scaffolds powered by frontier models, including Claude-3 and GPT-4o.
The results indicate that the Claude-3.5 Sonnet agent performs best across our benchmark, excelling in planning and developing machine learning models.
arXiv Detail & Related papers (2024-10-29T21:38:42Z) - MLR-Copilot: Autonomous Machine Learning Research based on Large Language Models Agents [10.86017322488788]
We present a new systematic framework, autonomous Machine Learning Research with large language models (MLR-Copilot)
It is designed to enhance machine learning research productivity through the automatic generation and implementation of research ideas using Large Language Model (LLM) agents.
We evaluate our framework on five machine learning research tasks and the experimental results show the framework's potential to facilitate the research progress and innovations.
arXiv Detail & Related papers (2024-08-26T05:55:48Z) - MLAgentBench: Evaluating Language Agents on Machine Learning Experimentation [96.71370747681078]
We introduce MLAgentBench, a suite of 13 tasks ranging from improving model performance on CIFAR-10 to recent research problems like BabyLM.
For each task, an agent can perform actions like reading/writing files, executing code, and inspecting outputs.
We benchmark agents based on Claude v1.0, Claude v2.1, Claude v3 Opus, GPT-4, GPT-4-turbo, Gemini-Pro, and Mixtral and find that a Claude v3 Opus agent is the best in terms of success rate.
arXiv Detail & Related papers (2023-10-05T04:06:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.