Related papers: On Entropy Control in LLM-RL Algorithms

On Entropy Control in LLM-RL Algorithms

URL: http://arxiv.org/abs/2509.03493v2
Date: Thu, 25 Sep 2025 09:05:58 GMT
Title: On Entropy Control in LLM-RL Algorithms
Authors: Han Shen,
Abstract summary: We study the issues of entropy bonus in LLM-RL setting.<n>We propose AEnt, an entropy control method that utilizes a new clamped entropy bonus with an automatically adjusted coefficient.<n>AEnt is tested in math-reasoning tasks under different base models and datasets, and it is observed that AEnt outperforms the baselines consistently.
Score: 10.71946318944523
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: For RL algorithms, appropriate entropy control is crucial to their effectiveness. To control the policy entropy, a commonly used method is entropy regularization, which is adopted in various popular RL algorithms including PPO, SAC and A3C. Although entropy regularization proves effective in robotic and games RL conventionally, studies found that it gives weak to no gains in LLM-RL training. In this work, we study the issues of entropy bonus in LLM-RL setting. Specifically, we first argue that the conventional entropy regularization suffers from the LLM's extremely large response space and the sparsity of the optimal outputs. As a remedy, we propose AEnt, an entropy control method that utilizes a new clamped entropy bonus with an automatically adjusted coefficient. The clamped entropy is evaluated with the re-normalized policy defined on certain smaller token space, which encourages exploration within a more compact response set. In addition, the algorithm automatically adjusts entropy coefficient according to the clamped entropy value, effectively controlling the entropy-induced bias while leveraging the entropy's benefits. AEnt is tested in math-reasoning tasks under different base models and datasets, and it is observed that AEnt outperforms the baselines consistently across multiple benchmarks.

Related papers

Revisiting Entropy in Reinforcement Learning for Large Reasoning Models [54.96908589622163]
We investigate the entropy dynamics of large language models trained withReinforcement learning with verifiable rewards (RLVR)<n>Our findings reveal that the number of off-policy updates, the diversity of training data, and the clipping thresholds in the optimization objective are critical factors influencing the entropy of LLMs trained with RLVR.
arXiv Detail & Related papers (2025-11-08T12:50:41Z)
Mind Your Entropy: From Maximum Entropy to Trajectory Entropy-Constrained RL [56.085103402298905]
We propose a trajectory entropy-constrained reinforcement learning (TECRL) framework to address these two challenges.<n>Within this framework, we first separately learn two Q-functions, one associated with reward and the other with entropy, ensuring clean and stable value targets unaffected by temperature updates.<n>We develop a practical off-policy algorithm, DSAC-E, by extending the state-of-the-art distributional soft actor-critic with three refinements.
arXiv Detail & Related papers (2025-10-25T09:17:47Z)
Rediscovering Entropy Regularization: Adaptive Coefficient Unlocks Its Potential for LLM Reinforcement Learning [55.59724323303857]
We propose a framework that balances exploration and exploitation via three components: difficulty-aware coefficient allocation, initial-anchored target entropy, and dynamic global coefficient adjustment.<n>Experiments on multiple mathematical reasoning benchmarks show that AER consistently outperforms baselines, improving both reasoning accuracy and exploration capability.
arXiv Detail & Related papers (2025-10-13T03:10:26Z)
Arbitrary Entropy Policy Optimization: Entropy Is Controllable in Reinforcement Fine-tuning [36.00460460149206]
We propose Arbitrary Entropy Policy Optimization (AEPO), which eliminates entropy collapse by replacing entropy bonuses with REINFORCE policy gradient on temperature-adjusted distributions.<n>AEPO integrates three key designs: policy gradient as regularization, distribution as regularization, and REINFORCE as regularization, enabling precise entropy control without distorting optimization.
arXiv Detail & Related papers (2025-10-09T12:24:08Z)
Global Convergence of Policy Gradient for Entropy Regularized Linear-Quadratic Control with multiplicative noise [7.339958589013675]
Reinforcement Learning (RL) has emerged as a powerful framework for sequential decision-making in dynamic environments.<n>This paper investigates RL-based control for entropy-regularized Quadra (LQC)<n>We introduce a novel model free RL algorithm: Sample-Based Regularized Policy Gradient (SBRPG)
arXiv Detail & Related papers (2025-10-03T11:03:12Z)
From Uniform to Heterogeneous: Tailoring Policy Optimization to Every Token's Nature [38.46122853450324]
Existing algorithms apply uniform optimization to all tokens, ignoring their different roles in reasoning process.<n>We introduce Heterogeneous Adaptive Policy Optimization (HAPO), a token-aware algorithm that dynamically adapts optimization based on token entropy.
arXiv Detail & Related papers (2025-09-20T09:30:25Z)
The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models [99.98293908799731]
This paper aims to overcome a major obstacle in scaling RL for reasoning with LLMs, namely the collapse of policy entropy.<n>In practice, we establish a transformation equation R=-a*eH+b between entropy H and downstream performance R.<n>We propose two simple yet effective techniques, namely Clip-Cov and KL-Cov, which clip and apply KL penalty to tokens with high covariances respectively.
arXiv Detail & Related papers (2025-05-28T17:38:45Z)
Efficient Learning of POMDPs with Known Observation Model in Average-Reward Setting [56.92178753201331]
We propose the Observation-Aware Spectral (OAS) estimation technique, which enables the POMDP parameters to be learned from samples collected using a belief-based policy. We show the consistency of the OAS procedure, and we prove a regret guarantee of order $mathcalO(sqrtT log(T)$ for the proposed OAS-UCRL algorithm.
arXiv Detail & Related papers (2024-10-02T08:46:34Z)
Provably Mitigating Overoptimization in RLHF: Your SFT Loss is Implicitly an Adversarial Regularizer [52.09480867526656]
We identify the source of misalignment as a form of distributional shift and uncertainty in learning human preferences.<n>To mitigate overoptimization, we first propose a theoretical algorithm that chooses the best policy for an adversarially chosen reward model.<n>Using the equivalence between reward models and the corresponding optimal policy, the algorithm features a simple objective that combines a preference optimization loss and a supervised learning loss.
arXiv Detail & Related papers (2024-05-26T05:38:50Z)
Predictable Reinforcement Learning Dynamics through Entropy Rate Minimization [16.335645061396455]
In Reinforcement Learning (RL), agents have no incentive to exhibit predictable behaviors.<n>We propose a novel method to induce predictable behavior in RL agents, termed Predictability-Aware RL (PARL)<n>Our method maximizes a linear combination of a standard discounted reward and the negative entropy rate, thus trading off optimality with predictability.
arXiv Detail & Related papers (2023-11-30T16:53:32Z)
Optimal scheduling of entropy regulariser for continuous-time linear-quadratic reinforcement learning [9.779769486156631]
Herein agent interacts with the environment by generating noisy controls distributed according to the optimal relaxed policy. This exploration-exploitation trade-off is determined by the strength of entropy regularisation. We prove that the regret, for both learning algorithms, is of the order $mathcalO(sqrtN) $ (up to a logarithmic factor) over $N$ episodes, matching the best known result from the literature.
arXiv Detail & Related papers (2022-08-08T23:36:40Z)
State Entropy Maximization with Random Encoders for Efficient Exploration [162.39202927681484]
Recent exploration methods have proven to be a recipe for improving sample-efficiency in deep reinforcement learning (RL) This paper presents Randoms for Efficient Exploration (RE3), an exploration method that utilizes state entropy as an intrinsic reward. In particular, we find that the state entropy can be estimated in a stable and compute-efficient manner by utilizing a randomly encoder.
arXiv Detail & Related papers (2021-02-18T15:45:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.