Related papers: Imitation Learning for Multi-turn LM Agents via On-policy Expert Corrections

Imitation Learning for Multi-turn LM Agents via On-policy Expert Corrections

URL: http://arxiv.org/abs/2512.14895v1
Date: Tue, 16 Dec 2025 20:19:07 GMT
Title: Imitation Learning for Multi-turn LM Agents via On-policy Expert Corrections
Authors: Niklas Lauffer, Xiang Deng, Srivatsa Kundurthy, Brad Kenstler, Jeff Da,
Abstract summary: A popular paradigm for training LM agents relies on imitation learning, fine-tuning on expert trajectories.<n>Taking inspiration from the classic DAgger algorithm, we propose a novel data generation methodology.
Score: 8.286067243223204
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: A popular paradigm for training LM agents relies on imitation learning, fine-tuning on expert trajectories. However, we show that the off-policy nature of imitation learning for multi-turn LM agents suffers from the fundamental limitation known as covariate shift: as the student policy's behavior diverges from the expert's, it encounters states not present in the training data, reducing the effectiveness of fine-tuning. Taking inspiration from the classic DAgger algorithm, we propose a novel data generation methodology for addressing covariate shift for multi-turn LLM training. We introduce on-policy expert corrections (OECs), partially on-policy data generated by starting rollouts with a student model and then switching to an expert model part way through the trajectory. We explore the effectiveness of our data generation technique in the domain of software engineering (SWE) tasks, a multi-turn setting where LLM agents must interact with a development environment to fix software bugs. Our experiments compare OEC data against various other on-policy and imitation learning approaches on SWE agent problems and train models using a common rejection sampling (i.e., using environment reward) combined with supervised fine-tuning technique. Experiments find that OEC trajectories show a relative 14% and 13% improvement over traditional imitation learning in the 7b and 32b setting, respectively, on SWE-bench verified. Our results demonstrate the need for combining expert demonstrations with on-policy data for effective multi-turn LM agent training.

Related papers

Sample-Efficient Online Learning in LM Agents via Hindsight Trajectory Rewriting [92.57796055887995]
We introduce ECHO, a prompting framework that adapts hindsight experience replay from reinforcement learning for language model agents.<n> ECHO generates optimized trajectories for alternative goals that could have been achieved during failed attempts.<n>We evaluate ECHO on stateful versions of XMiniGrid, a text-based navigation and planning benchmark, and PeopleJoinQA, a collaborative information-gathering enterprise simulation.
arXiv Detail & Related papers (2025-10-11T18:11:09Z)
On-Policy RL Meets Off-Policy Experts: Harmonizing Supervised Fine-Tuning and Reinforcement Learning via Dynamic Weighting [91.38734024438357]
Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) are two prominent post-training paradigms for refining the capabilities and aligning the behavior of Large Language Models (LLMs)<n>Existing approaches that integrate SFT and RL often face the risk of disrupting established response patterns and inducing overfitting to expert data.<n>We propose CHORD, a framework for Controllable Harmonization of On- and Off-Policy Reinforcement Learning via Dynamic Weighting.
arXiv Detail & Related papers (2025-08-15T11:20:03Z)
Training LLM-Based Agents with Synthetic Self-Reflected Trajectories and Partial Masking [61.61356842567952]
We propose STeP, a novel method for improving LLM-based agent training.<n>We synthesize self-reflected trajectories that include reflections and corrections of error steps.<n>Experiments demonstrate that our method improves agent performance across three representative tasks.
arXiv Detail & Related papers (2025-05-26T14:11:12Z)
AdvKT: An Adversarial Multi-Step Training Framework for Knowledge Tracing [64.79967583649407]
Knowledge Tracing (KT) monitors students' knowledge states and simulates their responses to question sequences.<n>Existing KT models typically follow a single-step training paradigm, which leads to significant error accumulation.<n>We propose a novel Adversarial Multi-Step Training Framework for Knowledge Tracing (AdvKT) which focuses on the multi-step KT task.
arXiv Detail & Related papers (2025-04-07T03:31:57Z)
From Novice to Expert: LLM Agent Policy Optimization via Step-wise Reinforcement Learning [62.54484062185869]
We introduce StepAgent, which utilizes step-wise reward to optimize the agent's reinforcement learning process.<n>We propose implicit-reward and inverse reinforcement learning techniques to facilitate agent reflection and policy adjustment.
arXiv Detail & Related papers (2024-11-06T10:35:11Z)
Conditional Neural Expert Processes for Learning Movement Primitives from Demonstration [1.9336815376402723]
Conditional Neural Expert Processes (CNEP) learns to assign demonstrations from different modes to distinct expert networks. CNEP does not require supervision on which mode the trajectories belong to. Our system is capable of on-the-fly adaptation to environmental changes via an online conditioning mechanism.
arXiv Detail & Related papers (2024-02-13T12:52:02Z)
Mimicking Better by Matching the Approximate Action Distribution [48.95048003354255]
We introduce MAAD, a novel, sample-efficient on-policy algorithm for Imitation Learning from Observations. We show that it requires considerable fewer interactions to achieve expert performance, outperforming current state-of-the-art on-policy methods.
arXiv Detail & Related papers (2023-06-16T12:43:47Z)
Model Predictive Control via On-Policy Imitation Learning [28.96122879515294]
We develop new sample complexity results and performance guarantees for data-driven Model Predictive Control. Our algorithm uses the structure of constrained linear MPC, and our analysis uses the properties of the explicit MPC solution to theoretically bound the number of online MPC trajectories needed to achieve optimal performance.
arXiv Detail & Related papers (2022-10-17T16:06:06Z)
Imitation Learning from Observations under Transition Model Disparity [22.456737935789103]
Learning to perform tasks by leveraging a dataset of expert observations (ILO) is an important paradigm for learning skills without access to the expert reward function or the expert actions. Recent methods for scalable ILO utilize adversarial learning to match the state-transition distributions of the expert and the learner. We propose an algorithm that trains an intermediary policy in the learner environment and uses it as a surrogate expert for the learner.
arXiv Detail & Related papers (2022-04-25T05:36:54Z)
SS-MAIL: Self-Supervised Multi-Agent Imitation Learning [18.283839252425803]
Two families of algorithms - Behavioral Cloning (BC) and Adversarial Imitation Learning (AIL) BC approaches suffer from compounding errors, as they ignore the sequential decision-making nature of the trajectory generation problem. AIL methods are plagued with instability in their training dynamics. We introduce a novel self-supervised loss that encourages the discriminator to approximate a richer reward function.
arXiv Detail & Related papers (2021-10-18T01:17:50Z)
CoDE: Collocation for Demonstration Encoding [31.220899638271856]
We present a data-efficient imitation learning technique called Collocation for Demonstration. We circumvent problematic back-propagation through time problems by introducing an auxiliary trajectory trajectory taking inspiration from collocation techniques in optimal control. We present experiments on a 7-degree-of-freedom robotic manipulator learning behavior shaping policies for efficient tabletop operation.
arXiv Detail & Related papers (2021-05-07T00:34:43Z)
DEALIO: Data-Efficient Adversarial Learning for Imitation from Observation [57.358212277226315]
In imitation learning from observation IfO, a learning agent seeks to imitate a demonstrating agent using only observations of the demonstrated behavior without access to the control signals generated by the demonstrator. Recent methods based on adversarial imitation learning have led to state-of-the-art performance on IfO problems, but they typically suffer from high sample complexity due to a reliance on data-inefficient, model-free reinforcement learning algorithms. This issue makes them impractical to deploy in real-world settings, where gathering samples can incur high costs in terms of time, energy, and risk. We propose a more data-efficient IfO algorithm
arXiv Detail & Related papers (2021-03-31T23:46:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.