Related papers: Learn the Ropes, Then Trust the Wins: Self-imitation with Progressive Exploration for Agentic Reinforcement Learning

Learn the Ropes, Then Trust the Wins: Self-imitation with Progressive Exploration for Agentic Reinforcement Learning

URL: http://arxiv.org/abs/2509.22601v2
Date: Thu, 09 Oct 2025 04:27:07 GMT
Title: Learn the Ropes, Then Trust the Wins: Self-imitation with Progressive Exploration for Agentic Reinforcement Learning
Authors: Yulei Qin, Xiaoyu Tan, Zhengbao He, Gang Li, Haojia Lin, Zongyi Li, Zihan Xu, Yuchen Shi, Siqi Cai, Renting Rui, Shaofei Cai, Yuzheng Cai, Xuan Zhang, Sheng Ye, Ke Li, Xing Sun,
Abstract summary: We propose SPEAR, a curriculum-based self-imitation learning (SIL) recipe for training agentic LLMs.<n>Specifically, our approach incorporates a curriculum to manage the exploration process, utilizing intrinsic rewards to foster skill-level exploration.<n>To further stabilize training, we recalibrate the advantages of experiences in the replay buffer to address the potential policy drift.
Score: 41.90621652673528
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Reinforcement learning (RL) is the dominant paradigm for sharpening strategic tool use capabilities of LLMs on long-horizon, sparsely-rewarded agent tasks, yet it faces a fundamental challenge of exploration-exploitation trade-off. Existing studies stimulate exploration through the lens of policy entropy, but such mechanical entropy maximization is prone to RL training instability due to the multi-turn distribution shifting. In this paper, we target the progressive exploration-exploitation balance under the guidance of the agent own experiences without succumbing to either entropy collapsing or runaway divergence. We propose SPEAR, a curriculum-based self-imitation learning (SIL) recipe for training agentic LLMs. It extends the vanilla SIL framework, where a replay buffer stores self-generated promising trajectories for off-policy update, by gradually steering the policy evolution within a well-balanced range of entropy across stages. Specifically, our approach incorporates a curriculum to manage the exploration process, utilizing intrinsic rewards to foster skill-level exploration and facilitating action-level exploration through SIL. At first, the auxiliary tool call reward plays a critical role in the accumulation of tool-use skills, enabling broad exposure to the unfamiliar distributions of the environment feedback with an upward entropy trend. As training progresses, self-imitation gets strengthened to exploit existing successful patterns from replayed experiences for comparative action-level exploration, accelerating solution iteration without unbounded entropy growth. To further stabilize training, we recalibrate the advantages of experiences in the replay buffer to address the potential policy drift. Reugularizations such as the clipping of tokens with high covariance between probability and advantage are introduced to the trajectory-level entropy control to curb over-confidence.

Related papers

RAPO: Expanding Exploration for LLM Agents via Retrieval-Augmented Policy Optimization [29.421185758698908]
Agentic Reinforcement Learning (Agentic RL) has shown remarkable potential in large language model-based (LLM) agents.<n>We propose Retrieval-Augmented Policy Optimization (RAPO), a novel RL framework that introduces retrieval to explicitly expand exploration during training.<n>RAPO achieves an +5.0% average gain on fourteen datasets across three agentic reasoning tasks, while delivering 1.2x faster training efficiency.
arXiv Detail & Related papers (2026-03-03T15:23:42Z)
Controllable Exploration in Hybrid-Policy RLVR for Multi-Modal Reasoning [88.42566960813438]
CalibRL is a hybrid-policy RLVR framework that supports controllable exploration with expert guidance.<n>CalibRL increases policy entropy in a guided manner and clarifies the target distribution.<n>Experiments across eight benchmarks, including both in-domain and out-of-domain settings, demonstrate consistent improvements.
arXiv Detail & Related papers (2026-02-22T07:23:36Z)
Experiential Reinforcement Learning [22.545003569634982]
Experiential Reinforcement Learning (ERL) is a training paradigm that embeds an explicit experience-reflection-consolidation loop into the reinforcement learning process.<n>ERL consistently improves learning efficiency and final performance over strong reinforcement learning baselines.<n>These results suggest that integrating explicit self-reflection into policy training provides a practical mechanism for transforming feedback into durable behavioral improvement.
arXiv Detail & Related papers (2026-02-15T01:23:48Z)
EPO: Entropy-regularized Policy Optimization for LLM Agents Reinforcement Learning [15.529826552402769]
Training LLM agents in multi-turn environments with sparse rewards presents a fundamental challenge for reinforcement learning.<n>We identify a critical failure mode unique to this setting: the exploration-exploitation cascade failure.<n>We propose Entropy-regularized Policy Optimization (EPO), a general framework that breaks this failure cycle through three synergistic mechanisms.
arXiv Detail & Related papers (2025-09-26T16:51:44Z)
Online Process Reward Leanring for Agentic Reinforcement Learning [92.26560379363492]
Large language models (LLMs) are increasingly trained with reinforcement learning (RL) as autonomous agents.<n>Recent work attempts to integrate process supervision into agent learning but suffers from biased annotation.<n>We introduce Online Process Reward Learning (OPRL), a general credit-assignment strategy for agentic RL.
arXiv Detail & Related papers (2025-09-23T16:15:42Z)
Agentic Reinforced Policy Optimization [66.96989268893932]
Large-scale reinforcement learning with verifiable rewards (RLVR) has demonstrated its effectiveness in harnessing the potential of large language models (LLMs) for single-turn reasoning tasks.<n>Current RL algorithms inadequately balance the models' intrinsic long-horizon reasoning capabilities and their proficiency in multi-turn tool interactions.<n>We propose Agentic Reinforced Policy Optimization (ARPO), a novel agentic RL algorithm tailored for training multi-turn LLM-based agents.
arXiv Detail & Related papers (2025-07-26T07:53:11Z)
RAGEN: Understanding Self-Evolution in LLM Agents via Multi-Turn Reinforcement Learning [125.96848846966087]
Training large language models (LLMs) as interactive agents presents unique challenges.<n>While reinforcement learning has enabled progress in static tasks, multi-turn agent RL training remains underexplored.<n>We propose StarPO, a general framework for trajectory-level agent RL, and introduce RAGEN, a modular system for training and evaluating LLM agents.
arXiv Detail & Related papers (2025-04-24T17:57:08Z)
From Novice to Expert: LLM Agent Policy Optimization via Step-wise Reinforcement Learning [62.54484062185869]
We introduce StepAgent, which utilizes step-wise reward to optimize the agent's reinforcement learning process.<n>We propose implicit-reward and inverse reinforcement learning techniques to facilitate agent reflection and policy adjustment.
arXiv Detail & Related papers (2024-11-06T10:35:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.