Mind Your Entropy: From Maximum Entropy to Trajectory Entropy-Constrained RL
- URL: http://arxiv.org/abs/2511.11592v1
- Date: Sat, 25 Oct 2025 09:17:47 GMT
- Title: Mind Your Entropy: From Maximum Entropy to Trajectory Entropy-Constrained RL
- Authors: Guojian Zhan, Likun Wang, Pengcheng Wang, Feihong Zhang, Jingliang Duan, Masayoshi Tomizuka, Shengbo Eben Li,
- Abstract summary: We propose a trajectory entropy-constrained reinforcement learning (TECRL) framework to address these two challenges.<n>Within this framework, we first separately learn two Q-functions, one associated with reward and the other with entropy, ensuring clean and stable value targets unaffected by temperature updates.<n>We develop a practical off-policy algorithm, DSAC-E, by extending the state-of-the-art distributional soft actor-critic with three refinements.
- Score: 56.085103402298905
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Maximum entropy has become a mainstream off-policy reinforcement learning (RL) framework for balancing exploitation and exploration. However, two bottlenecks still limit further performance improvement: (1) non-stationary Q-value estimation caused by jointly injecting entropy and updating its weighting parameter, i.e., temperature; and (2) short-sighted local entropy tuning that adjusts temperature only according to the current single-step entropy, without considering the effect of cumulative entropy over time. In this paper, we extends maximum entropy framework by proposing a trajectory entropy-constrained reinforcement learning (TECRL) framework to address these two challenges. Within this framework, we first separately learn two Q-functions, one associated with reward and the other with entropy, ensuring clean and stable value targets unaffected by temperature updates. Then, the dedicated entropy Q-function, explicitly quantifying the expected cumulative entropy, enables us to enforce a trajectory entropy constraint and consequently control the policy long-term stochasticity. Building on this TECRL framework, we develop a practical off-policy algorithm, DSAC-E, by extending the state-of-the-art distributional soft actor-critic with three refinements (DSAC-T). Empirical results on the OpenAI Gym benchmark demonstrate that our DSAC-E can achieve higher returns and better stability.
Related papers
- Thermodynamic significance of QUBO encoding on quantum annealers [0.0]
We study a Job Shop Scheduling instance using a two- parameter family of encodings controlled by penalty weights.<n>We find that the same encoding transitions that govern computational hardness also reorganize dissipation.<n>Our results establish QUBO penalties as thermodynamic control knobs and motivate thermodynamics-aware encoding strategies for noisy intermediate-scale quantum annealers.
arXiv Detail & Related papers (2026-01-07T21:18:54Z) - Agentic Entropy-Balanced Policy Optimization [114.90524574220764]
Agentic Reinforcement Learning (Agentic RL) has made significant progress in incentivizing the multi-turn, long-horizon tool-use capabilities of web agents.<n>While RL algorithms autonomously explore high-uncertainty tool-call steps under the guidance of entropy, excessive reliance on entropy signals can impose further constraints.<n>We propose the Agentic Entropy-Balanced Policy Optimization (AEPO), an agentic RL algorithm designed to balance entropy in both the rollout and policy update phases.
arXiv Detail & Related papers (2025-10-16T10:40:52Z) - Rediscovering Entropy Regularization: Adaptive Coefficient Unlocks Its Potential for LLM Reinforcement Learning [55.59724323303857]
We propose a framework that balances exploration and exploitation via three components: difficulty-aware coefficient allocation, initial-anchored target entropy, and dynamic global coefficient adjustment.<n>Experiments on multiple mathematical reasoning benchmarks show that AER consistently outperforms baselines, improving both reasoning accuracy and exploration capability.
arXiv Detail & Related papers (2025-10-13T03:10:26Z) - Rethinking Entropy Interventions in RLVR: An Entropy Change Perspective [11.65148836911294]
entropy collapse is a rapid loss of policy diversity, stemming from the exploration-exploitation imbalance and leading to a lack of generalization.<n>Recent entropy-intervention methods aim to prevent coloredtextentropy collapse, yet their underlying mechanisms remain unclear.<n>We introduce an entropy-change-aware reweighting scheme, namely Stabilizing Token-level Entropy-changE via Reweighting (STEER)
arXiv Detail & Related papers (2025-10-11T10:17:38Z) - Arbitrary Entropy Policy Optimization: Entropy Is Controllable in Reinforcement Fine-tuning [36.00460460149206]
We propose Arbitrary Entropy Policy Optimization (AEPO), which eliminates entropy collapse by replacing entropy bonuses with REINFORCE policy gradient on temperature-adjusted distributions.<n>AEPO integrates three key designs: policy gradient as regularization, distribution as regularization, and REINFORCE as regularization, enabling precise entropy control without distorting optimization.
arXiv Detail & Related papers (2025-10-09T12:24:08Z) - The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models [99.98293908799731]
This paper aims to overcome a major obstacle in scaling RL for reasoning with LLMs, namely the collapse of policy entropy.<n>In practice, we establish a transformation equation R=-a*eH+b between entropy H and downstream performance R.<n>We propose two simple yet effective techniques, namely Clip-Cov and KL-Cov, which clip and apply KL penalty to tokens with high covariances respectively.
arXiv Detail & Related papers (2025-05-28T17:38:45Z) - Quantum R\'enyi entropy by optimal thermodynamic integration paths [0.0]
We introduce here a theoretical framework based on an optimal thermodynamic integration scheme, where the R'enyi entropy can be efficiently evaluated.
We demonstrate it in the one-dimensional quantum Ising model and perform the evaluation of entanglement entropy in the formic acid dimer.
arXiv Detail & Related papers (2021-12-28T15:59:15Z) - Count-Based Temperature Scheduling for Maximum Entropy Reinforcement
Learning [81.30916012273161]
Max RL algorithms trade off reward and policy entropy to improve training stability and robustness.
Most Max RL methods use a constant tradeoff coefficient (temperature) to avoid overfitting to noisy value estimates.
We present a simple state-based temperature scheduling approach, and instantiate it as Count-Based Q-Learning (CB)
We evaluate our approach on a toy domain as well as in several Atari 2600 domains and show promising results.
arXiv Detail & Related papers (2021-11-28T18:28:55Z) - Action Redundancy in Reinforcement Learning [54.291331971813364]
We show that transition entropy can be described by two terms; namely, model-dependent transition entropy and action redundancy.
Our results suggest that action redundancy is a fundamental problem in reinforcement learning.
arXiv Detail & Related papers (2021-02-22T19:47:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.