Agentic Entropy-Balanced Policy Optimization
- URL: http://arxiv.org/abs/2510.14545v1
- Date: Thu, 16 Oct 2025 10:40:52 GMT
- Title: Agentic Entropy-Balanced Policy Optimization
- Authors: Guanting Dong, Licheng Bao, Zhongyuan Wang, Kangzhi Zhao, Xiaoxi Li, Jiajie Jin, Jinghan Yang, Hangyu Mao, Fuzheng Zhang, Kun Gai, Guorui Zhou, Yutao Zhu, Ji-Rong Wen, Zhicheng Dou,
- Abstract summary: Agentic Reinforcement Learning (Agentic RL) has made significant progress in incentivizing the multi-turn, long-horizon tool-use capabilities of web agents.<n>While RL algorithms autonomously explore high-uncertainty tool-call steps under the guidance of entropy, excessive reliance on entropy signals can impose further constraints.<n>We propose the Agentic Entropy-Balanced Policy Optimization (AEPO), an agentic RL algorithm designed to balance entropy in both the rollout and policy update phases.
- Score: 114.90524574220764
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, Agentic Reinforcement Learning (Agentic RL) has made significant progress in incentivizing the multi-turn, long-horizon tool-use capabilities of web agents. While mainstream agentic RL algorithms autonomously explore high-uncertainty tool-call steps under the guidance of entropy, excessive reliance on entropy signals can impose further constraints, leading to the training collapse. In this paper, we delve into the challenges caused by entropy and propose the Agentic Entropy-Balanced Policy Optimization (AEPO), an agentic RL algorithm designed to balance entropy in both the rollout and policy update phases. AEPO comprises two core components: (1) a dynamic entropy-balanced rollout mechanism that adaptively allocate global and branch sampling budget through entropy pre-monitoring, while imposing a branch penalty on consecutive high-entropy tool-call steps to prevent over-branching issues; and (2) Entropy-Balanced Policy Optimization that inserts a stop-gradient operation into the high-entropy clipping term to preserve and properly rescale gradients on high-entropy tokens, while incorporating entropy-aware advantage estimation to prioritize learning on high-uncertainty tokens. Results across 14 challenging datasets show that AEPO consistently outperforms 7 mainstream RL algorithms. With just 1K RL samples, Qwen3-14B with AEPO achieves impressive results: 47.6% on GAIA, 11.2% on Humanity's Last Exam, and 43.0% on WebWalker for Pass@1; 65.0% on GAIA, 26.0% on Humanity's Last Exam, and 70.0% on WebWalker for Pass@5. Further analysis reveals that AEPO improves rollout sampling diversity while maintaining stable policy entropy, facilitating scalable web agent training.
Related papers
- STAPO: Stabilizing Reinforcement Learning for LLMs by Silencing Rare Spurious Tokens [38.425692691443764]
ExistingReinforcement Learning (RL) fine-tuning methods rely heavily on entropy regularization and reweighting to maintain stability.<n>In practice, they often suffer from late-stage performance collapse, leading to degraded reasoning quality and unstable training.<n>We find that training instability can be caused by a tiny fraction of tokens, approximately 0.01%, which we term spurious tokens.<n>We propose Spurious-Token-Aware Policy Optimization (STAPO), which promotes stable and effective large-scale model refinement.
arXiv Detail & Related papers (2026-02-17T14:46:48Z) - AEGPO: Adaptive Entropy-Guided Policy Optimization for Diffusion Models [54.56296715999545]
Reinforcement learning from human feedback shows promise for aligning diffusion and flow models.<n>Policy optimization methods such as GRPO suffer from inefficient and static sampling strategies.<n>We propose Adaptive Entropy-Guided Policy Optimization (AEGPO), a novel dual-signal, dual-level adaptive optimization strategy.
arXiv Detail & Related papers (2026-02-06T16:09:50Z) - Entropy-Gated Selective Policy Optimization:Token-Level Gradient Allocation for Hybrid Training of Large Language Models [18.084251607403406]
Hybrid training methods for large language models combine supervised fine tuning (SFT) on expert demonstrations with reinforcement learning (RL) on model rollouts, typically at the sample level.<n>We propose Entropy Gated Selective Policy Optimization (EGSPO), a three stage framework that extends sample level mixing with token level gradient modulation.<n> EGSPO achieves consistent improvements on mathematical reasoning benchmarks, with gains of 3.8 percent on AIME and 2.9 percent on MATH over the CHORD phi baseline, while incurring only 3.4 percent additional computational overhead.
arXiv Detail & Related papers (2026-02-03T09:38:21Z) - Mind Your Entropy: From Maximum Entropy to Trajectory Entropy-Constrained RL [56.085103402298905]
We propose a trajectory entropy-constrained reinforcement learning (TECRL) framework to address these two challenges.<n>Within this framework, we first separately learn two Q-functions, one associated with reward and the other with entropy, ensuring clean and stable value targets unaffected by temperature updates.<n>We develop a practical off-policy algorithm, DSAC-E, by extending the state-of-the-art distributional soft actor-critic with three refinements.
arXiv Detail & Related papers (2025-10-25T09:17:47Z) - BAPO: Stabilizing Off-Policy Reinforcement Learning for LLMs via Balanced Policy Optimization with Adaptive Clipping [69.74252624161652]
We propose BAlanced Policy Optimization with Adaptive Clipping (BAPO)<n>BAPO dynamically adjusts clipping bounds to adaptively re-balance positive and negative contributions, preserve entropy, and stabilize RL optimization.<n>On AIME 2024 and AIME 2025 benchmarks, our 7B BAPO model surpasses open-source counterparts such as SkyWork-OR1-7B.
arXiv Detail & Related papers (2025-10-21T12:55:04Z) - Rediscovering Entropy Regularization: Adaptive Coefficient Unlocks Its Potential for LLM Reinforcement Learning [55.59724323303857]
We propose a framework that balances exploration and exploitation via three components: difficulty-aware coefficient allocation, initial-anchored target entropy, and dynamic global coefficient adjustment.<n>Experiments on multiple mathematical reasoning benchmarks show that AER consistently outperforms baselines, improving both reasoning accuracy and exploration capability.
arXiv Detail & Related papers (2025-10-13T03:10:26Z) - Arbitrary Entropy Policy Optimization: Entropy Is Controllable in Reinforcement Fine-tuning [36.00460460149206]
We propose Arbitrary Entropy Policy Optimization (AEPO), which eliminates entropy collapse by replacing entropy bonuses with REINFORCE policy gradient on temperature-adjusted distributions.<n>AEPO integrates three key designs: policy gradient as regularization, distribution as regularization, and REINFORCE as regularization, enabling precise entropy control without distorting optimization.
arXiv Detail & Related papers (2025-10-09T12:24:08Z) - EPO: Entropy-regularized Policy Optimization for LLM Agents Reinforcement Learning [15.529826552402769]
Training LLM agents in multi-turn environments with sparse rewards presents a fundamental challenge for reinforcement learning.<n>We identify a critical failure mode unique to this setting: the exploration-exploitation cascade failure.<n>We propose Entropy-regularized Policy Optimization (EPO), a general framework that breaks this failure cycle through three synergistic mechanisms.
arXiv Detail & Related papers (2025-09-26T16:51:44Z) - ResT: Reshaping Token-Level Policy Gradients for Tool-Use Large Language Models [62.82372407840088]
Large language models (LLMs) transcend passive generation and act as goal-directed agents by invoking external tools.<n>textbfReshaped textbfToken-level policy gradients (textbfResT) for tool-use tasks.<n>textbfResT achieves state-of-the-art results, outperforming prior methods by up to $8.76%$.
arXiv Detail & Related papers (2025-09-26T03:38:27Z) - SEEA-R1: Tree-Structured Reinforcement Fine-Tuning for Self-Evolving Embodied Agents [58.174206358223415]
Self-Evolving Embodied Agents-R1, or SEEA-R1, is the first reinforcement fine-tuning framework designed for self-evolving embodied agents.<n>We show that SEEA-R1 can support autonomous adaptation and reward-driven self-evolution.
arXiv Detail & Related papers (2025-06-26T18:00:07Z) - Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning [80.87085014818052]
Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a powerful approach to enhancing the reasoning capabilities of Large Language Models (LLMs)<n>In this work, we undertake a pioneering exploration of RLVR through the novel perspective of token entropy patterns.<n>We observe that only a small fraction of tokens exhibit high entropy, and these tokens act as critical forks that steer the model toward diverse reasoning pathways.
arXiv Detail & Related papers (2025-06-02T17:54:39Z) - The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models [99.98293908799731]
This paper aims to overcome a major obstacle in scaling RL for reasoning with LLMs, namely the collapse of policy entropy.<n>In practice, we establish a transformation equation R=-a*eH+b between entropy H and downstream performance R.<n>We propose two simple yet effective techniques, namely Clip-Cov and KL-Cov, which clip and apply KL penalty to tokens with high covariances respectively.
arXiv Detail & Related papers (2025-05-28T17:38:45Z) - Predictable Reinforcement Learning Dynamics through Entropy Rate Minimization [16.335645061396455]
In Reinforcement Learning (RL), agents have no incentive to exhibit predictable behaviors.<n>We propose a novel method to induce predictable behavior in RL agents, termed Predictability-Aware RL (PARL)<n>Our method maximizes a linear combination of a standard discounted reward and the negative entropy rate, thus trading off optimality with predictability.
arXiv Detail & Related papers (2023-11-30T16:53:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.