Related papers: Toward Ultra-Long-Horizon Agentic Science: Cognitive Accumulation for Machine Learning Engineering

Toward Ultra-Long-Horizon Agentic Science: Cognitive Accumulation for Machine Learning Engineering

URL: http://arxiv.org/abs/2601.10402v1
Date: Thu, 15 Jan 2026 13:52:04 GMT
Title: Toward Ultra-Long-Horizon Agentic Science: Cognitive Accumulation for Machine Learning Engineering
Authors: Xinyu Zhu, Yuzhu Cai, Zexi Liu, Bingyang Zheng, Cheng Wang, Rui Ye, Jiaao Chen, Hanrui Wang, Wei-Chen Wang, Yuzhi Zhang, Linfeng Zhang, Weinan E, Di Jin, Siheng Chen,
Abstract summary: We present ML-Master 2.0, an autonomous agent that masters ultra-long-horizon machine learning engineering (MLE)<n>By reframing context management as a process of cognitive accumulation, our approach introduces Hierarchical Cognitive Caching (HCC)<n>HCC allows agents to decouple immediate execution from long-term experimental strategy.<n>In evaluations on OpenAI's MLE-Bench under 24-hour budgets, ML-Master 2.0 achieves a state-of-the-art medal rate of 56.44%.
Score: 59.18634614089481
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The advancement of artificial intelligence toward agentic science is currently bottlenecked by the challenge of ultra-long-horizon autonomy, the ability to sustain strategic coherence and iterative correction over experimental cycles spanning days or weeks. While Large Language Models (LLMs) have demonstrated prowess in short-horizon reasoning, they are easily overwhelmed by execution details in the high-dimensional, delayed-feedback environments of real-world research, failing to consolidate sparse feedback into coherent long-term guidance. Here, we present ML-Master 2.0, an autonomous agent that masters ultra-long-horizon machine learning engineering (MLE) which is a representative microcosm of scientific discovery. By reframing context management as a process of cognitive accumulation, our approach introduces Hierarchical Cognitive Caching (HCC), a multi-tiered architecture inspired by computer systems that enables the structural differentiation of experience over time. By dynamically distilling transient execution traces into stable knowledge and cross-task wisdom, HCC allows agents to decouple immediate execution from long-term experimental strategy, effectively overcoming the scaling limits of static context windows. In evaluations on OpenAI's MLE-Bench under 24-hour budgets, ML-Master 2.0 achieves a state-of-the-art medal rate of 56.44%. Our findings demonstrate that ultra-long-horizon autonomy provides a scalable blueprint for AI capable of autonomous exploration beyond human-precedent complexities.

Related papers

OdysseyArena: Benchmarking Large Language Models For Long-Horizon, Active and Inductive Interactions [66.84396313837765]
We introduce OdysseyArena, which re-centers agent evaluation on long-horizon, active, and inductive interactions.<n>We provide a set of 120 tasks to measure an agent's inductive efficiency and long-horizon discovery.<n>We also introduce OdysseyArena-Challenge to stress-test agent stability across extreme interaction horizons.
arXiv Detail & Related papers (2026-02-05T16:31:43Z)
Can LLMs Do Rocket Science? Exploring the Limits of Complex Reasoning with GTOC 12 [0.1710384116816033]
Large Language Models (LLMs) have demonstrated remarkable proficiency in code generation and general reasoning.<n>This study investigates the limits of current AI agents by evaluating them against the 12th Global Trajectory Optimization Competition (GTOC 12)<n>We adapt the MLE-Bench framework to the domain of orbital mechanics and deploy an AIDE-based agent architecture to autonomously generate and refine mission solutions.
arXiv Detail & Related papers (2026-02-03T15:18:26Z)
Dynamic Intelligence Ceilings: Measuring Long-Horizon Limits of Planning and Creativity in Artificial Systems [0.0]
We argue that a central limitation of contemporary AI systems lies not in capability per se, but in the premature fixation of their performance frontier.<n>We introduce the concept of a emphDynamic Intelligence Ceiling (DIC), defined as the highest level of effective intelligence attainable by a system at a given time.<n>We operationalize DIC using two estimators: the emph Difficulty Ceiling (PDC), which captures the maximal reliably solvable difficulty under constrained resources, and the emphCeiling Drift Rate (CDR), which quantifies the temporal evolution of this frontier
arXiv Detail & Related papers (2026-01-03T00:13:45Z)
SelfAI: Building a Self-Training AI System with LLM Agents [79.10991818561907]
SelfAI is a general multi-agent platform that combines a User Agent for translating high-level research objectives into standardized experimental configurations.<n>An Experiment Manager orchestrates parallel, fault-tolerant training across heterogeneous hardware while maintaining a structured knowledge base for continuous feedback.<n>Across regression, computer vision, scientific computing, medical imaging, and drug discovery benchmarks, SelfAI consistently achieves strong performance and reduces redundant trials.
arXiv Detail & Related papers (2025-11-29T09:18:39Z)
ExpVid: A Benchmark for Experiment Video Understanding & Reasoning [65.17173232816818]
We introduce ExpVid, the first benchmark designed to systematically evaluate MLLMs on scientific experiment videos.<n>We evaluate 19 leading MLLMs on ExpVid and find that while they excel at coarse-grained recognition, they struggle with disambiguating fine details, tracking state changes over time, and linking experimental procedures to scientific outcomes.<n>Our results reveal a notable performance gap between proprietary and open-source models, particularly in high-order reasoning.
arXiv Detail & Related papers (2025-10-13T16:45:28Z)
Learning on the Job: An Experience-Driven Self-Evolving Agent for Long-Horizon Tasks [42.78572295558531]
Large Language Models have demonstrated remarkable capabilities across diverse domains, yet significant challenges persist when deploying them as AI agents for real-world long-horizon tasks.<n>Existing LLM agents suffer from a critical limitation: they are test-time static and cannot learn from experience, lacking the ability to accumulate knowledge and continuously improve on the job.<n>We propose MUSE, a novel agent framework that introduces an experience-driven, self-evolving system centered around a hierarchical Memory Module.
arXiv Detail & Related papers (2025-10-09T09:40:34Z)
LatentEvolve: Self-Evolving Test-Time Scaling in Latent Space [66.71318175695988]
Test-timeScaling (TTS) has been demonstrated to significantly enhance the reasoning capabilities of Large Language Models (LLMs) during the inference phase without altering model parameters.<n>We propose LatentEvolve, a self-evolving latent TTS framework inspired by the complementary learning system theory.
arXiv Detail & Related papers (2025-09-29T13:37:39Z)
ML-Master: Towards AI-for-AI via Integration of Exploration and Reasoning [49.25518866694287]
We propose ML-Master, a novel AI4AI agent that seamlessly integrates exploration and reasoning by employing a selectively scoped memory mechanism.<n>We evaluate ML-Master on the MLE-Bench, where it achieves a 29.3% average medal rate, significantly surpassing existing methods.
arXiv Detail & Related papers (2025-06-19T17:53:28Z)
Intrinsic Language-Guided Exploration for Complex Long-Horizon Robotic Manipulation Tasks [12.27904219271791]
Current reinforcement learning algorithms struggle in sparse and complex environments. We propose the Intrinsically Guided Exploration from Large Language Models (IGE-LLMs) framework.
arXiv Detail & Related papers (2023-09-28T11:14:52Z)
Incremental procedural and sensorimotor learning in cognitive humanoid robots [52.77024349608834]
This work presents a cognitive agent that can learn procedures incrementally. We show the cognitive functions required in each substage and how adding new functions helps address tasks previously unsolved by the agent. Results show that this approach is capable of solving complex tasks incrementally.
arXiv Detail & Related papers (2023-04-30T22:51:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.