Related papers: Arbitrary Entropy Policy Optimization: Entropy Is Controllable in Reinforcement Fine-tuning

Arbitrary Entropy Policy Optimization: Entropy Is Controllable in Reinforcement Fine-tuning

URL: http://arxiv.org/abs/2510.08141v4
Date: Thu, 23 Oct 2025 10:06:36 GMT
Title: Arbitrary Entropy Policy Optimization: Entropy Is Controllable in Reinforcement Fine-tuning
Authors: Chen Wang, Zhaochun Li, Jionghao Bai, Yuzhi Zhang, Shisheng Cui, Zhou Zhao, Yue Wang,
Abstract summary: We propose Arbitrary Entropy Policy Optimization (AEPO), which eliminates entropy collapse by replacing entropy bonuses with REINFORCE policy gradient on temperature-adjusted distributions.<n>AEPO integrates three key designs: policy gradient as regularization, distribution as regularization, and REINFORCE as regularization, enabling precise entropy control without distorting optimization.
Score: 36.00460460149206
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Reinforcement fine-tuning (RFT) is essential for enhancing the reasoning capabilities of large language models (LLM), yet the widely adopted Group Relative Policy Optimization (GRPO) suffers from entropy collapse, where entropy monotonically decreases, exploration vanishes, and policies converge prematurely. Existing entropy-regularized methods only partially alleviate this issue while introducing bias and instability, leaving entropy control unresolved and the connection between entropy, exploration, and performance unclear. We propose Arbitrary Entropy Policy Optimization (AEPO), which eliminates entropy collapse by replacing entropy bonuses with REINFORCE policy gradient on temperature-adjusted distributions and stabilizing entropy through temperature regulation. AEPO integrates three key designs: policy gradient as regularization, distribution as regularization, and REINFORCE as regularization, enabling precise entropy control without distorting optimization. Experiments demonstrate three major contributions: AEPO (1) stabilizes entropy at arbitrary target levels, effectively removing collapse in GRPO; (2) reveals a non-monotonic relation where performance first improves then declines with increasing entropy, clarifying the link between entropy, exploration, and reasoning; and (3) generalizes beyond entropy, providing a broader RFT paradigm where superior target distributions can serve as REINFORCE regularizers.

Related papers

Mind Your Entropy: From Maximum Entropy to Trajectory Entropy-Constrained RL [56.085103402298905]
We propose a trajectory entropy-constrained reinforcement learning (TECRL) framework to address these two challenges.<n>Within this framework, we first separately learn two Q-functions, one associated with reward and the other with entropy, ensuring clean and stable value targets unaffected by temperature updates.<n>We develop a practical off-policy algorithm, DSAC-E, by extending the state-of-the-art distributional soft actor-critic with three refinements.
arXiv Detail & Related papers (2025-10-25T09:17:47Z)
Agentic Entropy-Balanced Policy Optimization [114.90524574220764]
Agentic Reinforcement Learning (Agentic RL) has made significant progress in incentivizing the multi-turn, long-horizon tool-use capabilities of web agents.<n>While RL algorithms autonomously explore high-uncertainty tool-call steps under the guidance of entropy, excessive reliance on entropy signals can impose further constraints.<n>We propose the Agentic Entropy-Balanced Policy Optimization (AEPO), an agentic RL algorithm designed to balance entropy in both the rollout and policy update phases.
arXiv Detail & Related papers (2025-10-16T10:40:52Z)
Rediscovering Entropy Regularization: Adaptive Coefficient Unlocks Its Potential for LLM Reinforcement Learning [55.59724323303857]
We propose a framework that balances exploration and exploitation via three components: difficulty-aware coefficient allocation, initial-anchored target entropy, and dynamic global coefficient adjustment.<n>Experiments on multiple mathematical reasoning benchmarks show that AER consistently outperforms baselines, improving both reasoning accuracy and exploration capability.
arXiv Detail & Related papers (2025-10-13T03:10:26Z)
Rethinking Entropy Interventions in RLVR: An Entropy Change Perspective [11.65148836911294]
entropy collapse is a rapid loss of policy diversity, stemming from the exploration-exploitation imbalance and leading to a lack of generalization.<n>Recent entropy-intervention methods aim to prevent coloredtextentropy collapse, yet their underlying mechanisms remain unclear.<n>We introduce an entropy-change-aware reweighting scheme, namely Stabilizing Token-level Entropy-changE via Reweighting (STEER)
arXiv Detail & Related papers (2025-10-11T10:17:38Z)
EPO: Entropy-regularized Policy Optimization for LLM Agents Reinforcement Learning [15.529826552402769]
Training LLM agents in multi-turn environments with sparse rewards presents a fundamental challenge for reinforcement learning.<n>We identify a critical failure mode unique to this setting: the exploration-exploitation cascade failure.<n>We propose Entropy-regularized Policy Optimization (EPO), a general framework that breaks this failure cycle through three synergistic mechanisms.
arXiv Detail & Related papers (2025-09-26T16:51:44Z)
Complexity-Driven Policy Optimization [5.757876638037014]
We propose replacing the entropy bonus with a more robust complexity bonus.<n>This regularizer encourages policies that balance quantity (high entropy) with structure (high disequilibrium)<n>Such behaviors arise because the regularizer suppresses both extremes, e.g., maximal disorder and complete order, creating pressure for agents to discover structured yet adaptable strategies.
arXiv Detail & Related papers (2025-09-24T19:32:03Z)
BEE-RAG: Balanced Entropy Engineering for Retrieval-Augmented Generation [77.10390725623125]
We propose the balanced entropy-engineered RAG (BEE-RAG) framework to improve adaptability of RAG systems to varying context lengths.<n>BEE-RAG separates attention sensitivity from context length, ensuring a stable entropy level.<n>Building upon this, we introduce a zero-shot inference strategy for multi-importance estimation and a parameter-efficient adaptive fine-tuning mechanism.
arXiv Detail & Related papers (2025-08-07T07:37:25Z)
State Entropy Regularization for Robust Reinforcement Learning [49.08983925413188]
We show that state entropy regularization improves robustness to structured and spatially correlated perturbations.<n>These types of variation are common in transfer learning but often overlooked by standard robust reinforcement learning methods.
arXiv Detail & Related papers (2025-06-08T11:15:31Z)
The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models [99.98293908799731]
This paper aims to overcome a major obstacle in scaling RL for reasoning with LLMs, namely the collapse of policy entropy.<n>In practice, we establish a transformation equation R=-a*eH+b between entropy H and downstream performance R.<n>We propose two simple yet effective techniques, namely Clip-Cov and KL-Cov, which clip and apply KL penalty to tokens with high covariances respectively.
arXiv Detail & Related papers (2025-05-28T17:38:45Z)
Entropy annealing for policy mirror descent in continuous time and space [2.8255028200738455]
We study the impact of entropy regularization on the convergence of policy methods for exit time control problems.<n>This paper explains how entropy regularization improves policy optimization, even with the true gradient.
arXiv Detail & Related papers (2024-05-30T17:02:18Z)
Action Redundancy in Reinforcement Learning [54.291331971813364]
We show that transition entropy can be described by two terms; namely, model-dependent transition entropy and action redundancy. Our results suggest that action redundancy is a fundamental problem in reinforcement learning.
arXiv Detail & Related papers (2021-02-22T19:47:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.