EPO: Entropy-regularized Policy Optimization for LLM Agents Reinforcement Learning
- URL: http://arxiv.org/abs/2509.22576v1
- Date: Fri, 26 Sep 2025 16:51:44 GMT
- Title: EPO: Entropy-regularized Policy Optimization for LLM Agents Reinforcement Learning
- Authors: Xu Wujiang, Wentian Zhao, Zhenting Wang, Li Yu-Jhe, Jin Can, Jin Mingyu, Mei Kai, Wan Kun, Metaxas Dimitris,
- Abstract summary: Training LLM agents in multi-turn environments with sparse rewards presents a fundamental challenge for reinforcement learning.<n>We identify a critical failure mode unique to this setting: the exploration-exploitation cascade failure.<n>We propose Entropy-regularized Policy Optimization (EPO), a general framework that breaks this failure cycle through three synergistic mechanisms.
- Score: 15.529826552402769
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Training LLM agents in multi-turn environments with sparse rewards, where completing a single task requires 30+ turns of interaction within an episode, presents a fundamental challenge for reinforcement learning. We identify a critical failure mode unique to this setting: the exploration-exploitation cascade failure. This cascade begins with early-stage policy premature convergence, where sparse feedback causes agents to commit to flawed, low-entropy strategies. Subsequently, agents enter late-stage policy collapse, where conventional entropy regularization becomes counterproductive, promoting chaotic exploration that destabilizes training. We propose Entropy-regularized Policy Optimization (EPO), a general framework that breaks this failure cycle through three synergistic mechanisms: (1) adopting entropy regularization in multi-turn settings to enhance exploration, (2) an entropy smoothing regularizer that bounds policy entropy within historical averages to prevent abrupt fluctuations, and (3) adaptive phase-based weighting that balances exploration and exploitation across training. Our analysis justifies that EPO guarantees monotonically decreasing entropy variance while maintaining convergence. EPO achieves up to 152% performance improvement on ScienceWorld and up to 19.8% on ALFWorld. Our work demonstrates that multi-turn sparse-reward settings require fundamentally different entropy control than traditional RL, with broad implications for LLM agent training.
Related papers
- A Step Back: Prefix Importance Ratio Stabilizes Policy Optimization [58.116300485427764]
Reinforcement learning post-training can elicit reasoning behaviors in large language models.<n> token-level correction often leads to unstable training dynamics when the degree of off-policyness is large.<n>We propose a simple yet effective objective, Minimum Prefix Ratio (MinPRO)
arXiv Detail & Related papers (2026-01-30T08:47:19Z) - M-GRPO: Stabilizing Self-Supervised Reinforcement Learning for Large Language Models with Momentum-Anchored Policy Optimization [9.358876832727239]
Self-supervised reinforcement learning (RL) presents a promising approach for enhancing the reasoning capabilities of Large Language Models (LLMs)<n>We find that existing methods suffer from a critical failure mode under long-horizon training: a "policy collapse" where performance precipitously degrades.<n>We introduce M-GRPO, a framework that leverages a slowly evolving momentum model to provide a stable training target.<n>We also propose an adaptive filtering method based on the interquartile range (IQR) that dynamically prunes low-entropy trajectories.
arXiv Detail & Related papers (2025-12-15T08:07:23Z) - Entropy Ratio Clipping as a Soft Global Constraint for Stable Reinforcement Learning [49.92803982100042]
We propose using the entropy ratio between the current and previous policies as a new global metric.<n>We introduce an textbfEntropy Ratio (ERC) mechanism that imposes bidirectional constraints on the entropy ratio.<n>This stabilizes policy updates at the global distribution level and compensates for the inability of PPO-clip to regulate probability shifts of un-sampled actions.
arXiv Detail & Related papers (2025-12-05T10:26:32Z) - Stabilizing Reinforcement Learning with LLMs: Formulation and Practices [61.361819972410046]
We show why and under what conditions the true sequence-level reward can be optimized via a surrogate token-level objective in policy gradient methods such as REINFORCE.<n>This insight provides a principled explanation for the crucial role of several widely adopted techniques in stabilizing RL training.
arXiv Detail & Related papers (2025-12-01T07:45:39Z) - BAPO: Stabilizing Off-Policy Reinforcement Learning for LLMs via Balanced Policy Optimization with Adaptive Clipping [69.74252624161652]
We propose BAlanced Policy Optimization with Adaptive Clipping (BAPO)<n>BAPO dynamically adjusts clipping bounds to adaptively re-balance positive and negative contributions, preserve entropy, and stabilize RL optimization.<n>On AIME 2024 and AIME 2025 benchmarks, our 7B BAPO model surpasses open-source counterparts such as SkyWork-OR1-7B.
arXiv Detail & Related papers (2025-10-21T12:55:04Z) - Rediscovering Entropy Regularization: Adaptive Coefficient Unlocks Its Potential for LLM Reinforcement Learning [55.59724323303857]
We propose a framework that balances exploration and exploitation via three components: difficulty-aware coefficient allocation, initial-anchored target entropy, and dynamic global coefficient adjustment.<n>Experiments on multiple mathematical reasoning benchmarks show that AER consistently outperforms baselines, improving both reasoning accuracy and exploration capability.
arXiv Detail & Related papers (2025-10-13T03:10:26Z) - Arbitrary Entropy Policy Optimization: Entropy Is Controllable in Reinforcement Fine-tuning [36.00460460149206]
We propose Arbitrary Entropy Policy Optimization (AEPO), which eliminates entropy collapse by replacing entropy bonuses with REINFORCE policy gradient on temperature-adjusted distributions.<n>AEPO integrates three key designs: policy gradient as regularization, distribution as regularization, and REINFORCE as regularization, enabling precise entropy control without distorting optimization.
arXiv Detail & Related papers (2025-10-09T12:24:08Z) - Learn the Ropes, Then Trust the Wins: Self-imitation with Progressive Exploration for Agentic Reinforcement Learning [41.90621652673528]
We propose SPEAR, a curriculum-based self-imitation learning (SIL) recipe for training agentic LLMs.<n>Specifically, our approach incorporates a curriculum to manage the exploration process, utilizing intrinsic rewards to foster skill-level exploration.<n>To further stabilize training, we recalibrate the advantages of experiences in the replay buffer to address the potential policy drift.
arXiv Detail & Related papers (2025-09-26T17:20:38Z) - HAEPO: History-Aggregated Exploratory Policy Optimization [4.782714372521615]
We introduce History-Aggregated Exploratory Policy Optimization (HAEPO), a history-aware exploratory loss to combat shortcomings.<n>HAEPO compresses each trajectory into the sum of its logarithmic probabilities, and applies a Plackett-Luce softmax across trajectories.<n> Empirically, HAEPO converges fast, explores thoroughly, aligns closely with true rewards, and demonstrates robust learning behavior better or at par with PPO, GRPO, and DPO.
arXiv Detail & Related papers (2025-08-26T09:59:44Z) - On-Policy RL with Optimal Reward Baseline [109.47676554514193]
On-Policy RL with Optimal reward baseline (OPO) is a novel and simplified reinforcement learning algorithm.<n>OPO emphasizes the importance of exact on-policy training, which empirically stabilizes the training process and enhances exploration.<n>Results demonstrate OPO's superior performance and training stability without additional models or regularization terms.
arXiv Detail & Related papers (2025-05-29T15:58:04Z) - The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models [99.98293908799731]
This paper aims to overcome a major obstacle in scaling RL for reasoning with LLMs, namely the collapse of policy entropy.<n>In practice, we establish a transformation equation R=-a*eH+b between entropy H and downstream performance R.<n>We propose two simple yet effective techniques, namely Clip-Cov and KL-Cov, which clip and apply KL penalty to tokens with high covariances respectively.
arXiv Detail & Related papers (2025-05-28T17:38:45Z) - PPO-BR: Dual-Signal Entropy-Reward Adaptation for Trust Region Policy Optimization [0.0]
PPO-BR establishes a paradigm adaptive RL by fusing new exploration and convergence signals into a single trust region.<n>This work bridges a critical gap in phase-aware learning, enabling real-world deployment in safety-critical systems like robotic surgery.
arXiv Detail & Related papers (2025-05-23T10:30:58Z) - Survival of the Fittest: Evolutionary Adaptation of Policies for Environmental Shifts [0.15889427269227555]
We develop an adaptive re-training algorithm inspired by evolutionary game theory (EGT)
ERPO shows faster policy adaptation, higher average rewards, and reduced computational costs in policy adaptation.
arXiv Detail & Related papers (2024-10-22T09:29:53Z) - WARP: On the Benefits of Weight Averaged Rewarded Policies [66.95013068137115]
We introduce a novel alignment strategy named Weight Averaged Rewarded Policies (WARP)
WARP merges policies in the weight space at three distinct stages.
Experiments with GEMMA policies validate that WARP improves their quality and alignment, outperforming other open-source LLMs.
arXiv Detail & Related papers (2024-06-24T16:24:34Z) - Robust Policy Optimization in Deep Reinforcement Learning [16.999444076456268]
In continuous action domains, parameterized distribution of action distribution allows easy control of exploration.
In particular, we propose an algorithm called Robust Policy Optimization (RPO), which leverages a perturbed distribution.
We evaluated our methods on various continuous control tasks from DeepMind Control, OpenAI Gym, Pybullet, and IsaacGym.
arXiv Detail & Related papers (2022-12-14T22:43:56Z) - Faster Last-iterate Convergence of Policy Optimization in Zero-Sum
Markov Games [63.60117916422867]
This paper focuses on the most basic setting of competitive multi-agent RL, namely two-player zero-sum Markov games.
We propose a single-loop policy optimization method with symmetric updates from both agents, where the policy is updated via the entropy-regularized optimistic multiplicative weights update (OMWU) method.
Our convergence results improve upon the best known complexities, and lead to a better understanding of policy optimization in competitive Markov games.
arXiv Detail & Related papers (2022-10-03T16:05:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.