AI Research Agents for Machine Learning: Search, Exploration, and Generalization in MLE-bench
- URL: http://arxiv.org/abs/2507.02554v1
- Date: Thu, 03 Jul 2025 11:59:15 GMT
- Title: AI Research Agents for Machine Learning: Search, Exploration, and Generalization in MLE-bench
- Authors: Edan Toledo, Karen Hambardzumyan, Martin Josifoski, Rishi Hazra, Nicolas Baldwin, Alexis Audran-Reiss, Michael Kuchnik, Despoina Magka, Minqi Jiang, Alisia Maria Lupidi, Andrei Lupu, Roberta Raileanu, Kelvin Niu, Tatiana Shavrina, Jean-Christophe Gagnon-Audet, Michael Shvartsman, Shagun Sodhani, Alexander H. Miller, Abhishek Charnalia, Derek Dunfield, Carole-Jean Wu, Pontus Stenetorp, Nicola Cancedda, Jakob Nicolaus Foerster, Yoram Bachrach,
- Abstract summary: We formalize AI research agents as search policies that navigate a space of candidate solutions, iteratively modifying them using operators.<n>Our best pairing of search strategy and operator set achieves a state-of-the-art result on MLE-bench lite, increasing the success rate of achieving a Kaggle medal from 39.6% to 47.7%.
- Score: 65.21702462691933
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: AI research agents are demonstrating great potential to accelerate scientific progress by automating the design, implementation, and training of machine learning models. We focus on methods for improving agents' performance on MLE-bench, a challenging benchmark where agents compete in Kaggle competitions to solve real-world machine learning problems. We formalize AI research agents as search policies that navigate a space of candidate solutions, iteratively modifying them using operators. By designing and systematically varying different operator sets and search policies (Greedy, MCTS, Evolutionary), we show that their interplay is critical for achieving high performance. Our best pairing of search strategy and operator set achieves a state-of-the-art result on MLE-bench lite, increasing the success rate of achieving a Kaggle medal from 39.6% to 47.7%. Our investigation underscores the importance of jointly considering the search strategy, operator design, and evaluation methodology in advancing automated machine learning.
Related papers
- EvolveSearch: An Iterative Self-Evolving Search Agent [98.18686493123785]
Large language models (LLMs) have transformed agentic information seeking capabilities through the integration of tools such as search engines and web browsers.<n>We propose EvolveSearch, a novel iterative self-evolution framework that combines SFT and RL to enhance agentic web search capabilities without any external human-annotated reasoning data.
arXiv Detail & Related papers (2025-05-28T15:50:48Z) - MLE-STAR: Machine Learning Engineering Agent via Search and Targeted Refinement [73.34265922786763]
We propose MLE-STAR, a novel approach to build machine learning agents.<n>MLE-STAR first leverages external knowledge by using a search engine to retrieve effective models from the web.<n>We introduce a novel ensembling method using an effective strategy suggested by MLE-STAR.
arXiv Detail & Related papers (2025-05-27T18:11:25Z) - ResearchCodeAgent: An LLM Multi-Agent System for Automated Codification of Research Methodologies [16.90884865239373]
We introduce ResearchCodeAgent, a novel multi-agent system to automate the codification of research methodologies.<n>The system bridges the gap between high-level research concepts and their practical implementation.<n>ResearchCodeAgent represents a significant step towards the research implementation process, potentially accelerating the pace of machine learning research.
arXiv Detail & Related papers (2025-04-28T07:18:45Z) - MLGym: A New Framework and Benchmark for Advancing AI Research Agents [51.9387884953294]
We introduce Meta MLGym and MLGym-Bench, a new framework and benchmark for evaluating and developing large language models on AI research tasks.<n>This is the first Gym environment for machine learning (ML) tasks, enabling research on reinforcement learning (RL) algorithms for training such agents.<n>We evaluate a number of frontier large language models (LLMs) on our benchmarks such as Claude-3.5-Sonnet, Llama-3.1 405B, GPT-4o, o1-preview, and Gemini-1.5 Pro.
arXiv Detail & Related papers (2025-02-20T12:28:23Z) - ML Research Benchmark [0.0]
We present the ML Research Benchmark (MLRB), comprising 7 competition-level tasks derived from recent machine learning conference tracks.
This paper introduces a novel benchmark and evaluates it using agent scaffolds powered by frontier models, including Claude-3 and GPT-4o.
The results indicate that the Claude-3.5 Sonnet agent performs best across our benchmark, excelling in planning and developing machine learning models.
arXiv Detail & Related papers (2024-10-29T21:38:42Z) - ExACT: Teaching AI Agents to Explore with Reflective-MCTS and Exploratory Learning [78.42927884000673]
ExACT is an approach to combine test-time search and self-learning to build o1-like models for agentic applications.<n>We first introduce Reflective Monte Carlo Tree Search (R-MCTS), a novel test time algorithm designed to enhance AI agents' ability to explore decision space on the fly.<n>Next, we introduce Exploratory Learning, a novel learning strategy to teach agents to search at inference time without relying on any external search algorithms.
arXiv Detail & Related papers (2024-10-02T21:42:35Z) - Tree Search for Language Model Agents [69.43007235771383]
We propose an inference-time search algorithm for LM agents to perform exploration and multi-step planning in interactive web environments.
Our approach is a form of best-first tree search that operates within the actual environment space.
It is the first tree search algorithm for LM agents that shows effectiveness on realistic web tasks.
arXiv Detail & Related papers (2024-07-01T17:07:55Z) - Boosting Search Engines with Interactive Agents [25.89284695491093]
This paper presents first steps in designing agents that learn meta-strategies for contextual query refinements.
Agents are empowered with simple but effective search operators to exert fine-grained and transparent control over queries and search results.
arXiv Detail & Related papers (2021-09-01T13:11:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.