Reinforcement Learning for Machine Learning Engineering Agents
- URL: http://arxiv.org/abs/2509.01684v1
- Date: Mon, 01 Sep 2025 18:04:10 GMT
- Title: Reinforcement Learning for Machine Learning Engineering Agents
- Authors: Sherry Yang, Joy He-Yueya, Percy Liang,
- Abstract summary: We show that agents backed by weaker models that improve via reinforcement learning can outperform agents backed by much larger, but static models.<n>We propose duration- aware gradient updates in a distributed asynchronous RL framework to amplify high-cost but high-reward actions.<n>We also propose environment instrumentation to offer partial credit, distinguishing almost-correct programs from those that fail early.
- Score: 52.03168614623642
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Existing agents for solving tasks such as ML engineering rely on prompting powerful language models. As a result, these agents do not improve with more experience. In this paper, we show that agents backed by weaker models that improve via reinforcement learning (RL) can outperform agents backed by much larger, but static models. We identify two major challenges with RL in this setting. First, actions can take a variable amount of time (e.g., executing code for different solutions), which leads to asynchronous policy gradient updates that favor faster but suboptimal solutions. To tackle variable-duration actions, we propose duration- aware gradient updates in a distributed asynchronous RL framework to amplify high-cost but high-reward actions. Second, using only test split performance as a reward provides limited feedback. A program that is nearly correct is treated the same as one that fails entirely. To address this, we propose environment instrumentation to offer partial credit, distinguishing almost-correct programs from those that fail early (e.g., during data loading). Environment instrumentation uses a separate static language model to insert print statement to an existing program to log the agent's experimental progress, from which partial credit can be extracted as reward signals for learning. Our experimental results on MLEBench suggest that performing gradient updates on a much smaller model (Qwen2.5-3B) trained with RL outperforms prompting a much larger model (Claude-3.5-Sonnet) with agent scaffolds, by an average of 22% across 12 Kaggle tasks.
Related papers
- SimuAgent: An LLM-Based Simulink Modeling Assistant Enhanced with Reinforcement Learning [3.1436750864792375]
We introduce SimuAgent, an LLM-powered modeling and simulation agent tailored for Simulink.<n>SimuAgent replaces XML with a concise, dictionary-style Python representation, dramatically cutting token counts.<n>A lightweight plan-execute architecture, trained in two stages, equips the agent with both low-level tool skills and high-level design reasoning.
arXiv Detail & Related papers (2026-01-08T18:10:35Z) - DiRL: An Efficient Post-Training Framework for Diffusion Language Models [54.405206032785706]
Diffusion Language Models (dLLMs) have emerged as promising alternatives to Auto-Regressive (AR) models.<n>Existing methods suffer from computational inefficiency and objective mismatches between training and inference.<n>We introduce DiRL, an efficient post-training framework that tightly integrates FlexAttention-accelerated blockwise training with LMDeploy-optimized inference.
arXiv Detail & Related papers (2025-12-23T08:33:19Z) - Sample-Efficient Online Learning in LM Agents via Hindsight Trajectory Rewriting [92.57796055887995]
We introduce ECHO, a prompting framework that adapts hindsight experience replay from reinforcement learning for language model agents.<n> ECHO generates optimized trajectories for alternative goals that could have been achieved during failed attempts.<n>We evaluate ECHO on stateful versions of XMiniGrid, a text-based navigation and planning benchmark, and PeopleJoinQA, a collaborative information-gathering enterprise simulation.
arXiv Detail & Related papers (2025-10-11T18:11:09Z) - Reinforced Language Models for Sequential Decision Making [6.971286730860635]
Large Language Models (LLMs) show potential as sequential decision-making agents.<n>Existing post-training methods are designed for single-turn interactions and cannot handle credit assignment in multi-step agentic tasks.<n>This work demonstrates that targeted post-training is a practical and efficient alternative to relying on model scale for creating sequential decision-making agents.
arXiv Detail & Related papers (2025-08-14T17:05:44Z) - Intention-Conditioned Flow Occupancy Models [69.79049994662591]
Large-scale pre-training has fundamentally changed how machine learning research is done today.<n>Applying this same framework to reinforcement learning is appealing because it offers compelling avenues for addressing core challenges in RL.<n>Recent advances in generative AI have provided new tools for modeling highly complex distributions.
arXiv Detail & Related papers (2025-06-10T15:27:46Z) - Runaway is Ashamed, But Helpful: On the Early-Exit Behavior of Large Language Model-based Agents in Embodied Environments [55.044159987218436]
Large language models (LLMs) have demonstrated strong planning and decision-making capabilities in complex embodied environments.<n>We take a first step toward exploring the early-exit behavior for LLM-based agents.
arXiv Detail & Related papers (2025-05-23T08:23:36Z) - Boosting Virtual Agent Learning and Reasoning: A Step-Wise, Multi-Dimensional, and Generalist Reward Model with Benchmark [72.46357004059661]
Generalist Virtual Agents (GVAs) have shown significant promise in autonomous task execution.<n>To address these challenges, we propose Similar, a Step-Wise Multi-Dimensional Generalist Reward Model.<n>Similar offers fine-grained signals for agent training and can choose better action for inference-time scaling.
arXiv Detail & Related papers (2025-03-24T13:30:47Z) - VLM Agents Generate Their Own Memories: Distilling Experience into Embodied Programs of Thought [41.72701516732208]
Large-scale generative language and vision-language models (LLMs and VLMs) excel in few-shot learning but require high-quality demonstrations.<n>We propose In-Context Abstraction Learning (ICAL), enabling VLM agents to transform suboptimal trajectories into high-quality training data.
arXiv Detail & Related papers (2024-06-20T17:45:02Z) - Solving Offline Reinforcement Learning with Decision Tree Regression [0.0]
This study presents a novel approach to addressing offline reinforcement learning problems by reframing them as regression tasks.
We introduce two distinct frameworks: return-conditioned and return-weighted decision tree policies.
Despite the simplification inherent in this reformulated approach to offline RL, our agents demonstrate performance that is at least on par with the established methods.
arXiv Detail & Related papers (2024-01-21T23:50:46Z) - Learning to Optimize for Reinforcement Learning [58.01132862590378]
Reinforcement learning (RL) is essentially different from supervised learning, and in practice, these learneds do not work well even in simple RL tasks.
Agent-gradient distribution is non-independent and identically distributed, leading to inefficient meta-training.
We show that, although only trained in toy tasks, our learned can generalize unseen complex tasks in Brax.
arXiv Detail & Related papers (2023-02-03T00:11:02Z) - Retrieval-Augmented Reinforcement Learning [63.32076191982944]
We train a network to map a dataset of past experiences to optimal behavior.
The retrieval process is trained to retrieve information from the dataset that may be useful in the current context.
We show that retrieval-augmented R2D2 learns significantly faster than the baseline R2D2 agent and achieves higher scores.
arXiv Detail & Related papers (2022-02-17T02:44:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.