Rethinking Entropy Regularization in Large Reasoning Models
- URL: http://arxiv.org/abs/2509.25133v1
- Date: Mon, 29 Sep 2025 17:49:25 GMT
- Title: Rethinking Entropy Regularization in Large Reasoning Models
- Authors: Yuxian Jiang, Yafu Li, Guanxu Chen, Dongrui Liu, Yu Cheng, Jing Shao,
- Abstract summary: Reinforcement learning with verifiable rewards (RLVR) has shown great promise in enhancing the reasoning abilities of large reasoning models (LRMs)<n>It suffers from a critical issue: entropy collapse and premature convergence.<n>We propose SIREN (SelectIve entRopy rEgularizatioN), a method that confines exploration to a meaningful subset of actions and states.
- Score: 43.961667993429906
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Reinforcement learning with verifiable rewards (RLVR) has shown great promise in enhancing the reasoning abilities of large reasoning models (LRMs). However, it suffers from a critical issue: entropy collapse and premature convergence. Naive entropy regularization, a common approach for encouraging exploration in the traditional RL literature, fails to address this problem in the context of LRM. Our analysis reveals that this failure stems from the vast action space and long trajectories in LRMs, which easily trigger a global entropy explosion as the model indiscriminately explores all possible actions and states. To address this, we propose SIREN (SelectIve entRopy rEgularizatioN), a method that confines exploration to a meaningful subset of actions and states. SIREN achieves this through a two-step entropy masking mechanism, consisting of a top-p mask and a peak-entropy mask. In addition, regularization is transformed into a self-anchored form to stabilize training. Across five mathematical benchmarks, SIREN attains superior average performance over previous entropy-related RLVR approaches, exemplified by a +6.6 maj@k improvement on AIME24/25 with Qwen2.5-Math-7B. Further analysis confirms that SIREN promotes greater response diversity and maintains entropy at an appropriate level, which helps to preserve the validation pass@k throughout training. This effectively mitigates the premature convergence problem common in RLVR for LRM.
Related papers
- Controllable Exploration in Hybrid-Policy RLVR for Multi-Modal Reasoning [88.42566960813438]
CalibRL is a hybrid-policy RLVR framework that supports controllable exploration with expert guidance.<n>CalibRL increases policy entropy in a guided manner and clarifies the target distribution.<n>Experiments across eight benchmarks, including both in-domain and out-of-domain settings, demonstrate consistent improvements.
arXiv Detail & Related papers (2026-02-22T07:23:36Z) - Back to Basics: Revisiting Exploration in Reinforcement Learning for LLM Reasoning via Generative Probabilities [10.235183326885794]
Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as an indispensable paradigm for enhancing reasoning in Large Language Models (LLMs)<n>We analyze this issue from the perspective of sampling probability dynamics, identifying that the standard objective disproportionately reinforces the highest-likelihood paths.<n>We propose a novel Advantage Re-weighting Mechanism (ARM) designed to equilibrate the confidence levels across all correct responses.
arXiv Detail & Related papers (2026-02-05T04:06:55Z) - APR: Penalizing Structural Redundancy in Large Reasoning Models via Anchor-based Process Rewards [61.52322047892064]
Test-Time Scaling (TTS) has significantly enhanced the capabilities of Large Reasoning Models (LRMs)<n>We observe that LRMs frequently conduct repetitive self-verification without revision even after obtaining the final answer during the reasoning process.<n>We propose Anchor-based Process Reward (APR), a structure-aware reward shaping method that localizes the reasoning anchor and penalizes exclusively the post-anchor AST.
arXiv Detail & Related papers (2026-01-31T14:53:20Z) - Revisiting Entropy in Reinforcement Learning for Large Reasoning Models [54.96908589622163]
We investigate the entropy dynamics of large language models trained withReinforcement learning with verifiable rewards (RLVR)<n>Our findings reveal that the number of off-policy updates, the diversity of training data, and the clipping thresholds in the optimization objective are critical factors influencing the entropy of LLMs trained with RLVR.
arXiv Detail & Related papers (2025-11-08T12:50:41Z) - Rediscovering Entropy Regularization: Adaptive Coefficient Unlocks Its Potential for LLM Reinforcement Learning [55.59724323303857]
We propose a framework that balances exploration and exploitation via three components: difficulty-aware coefficient allocation, initial-anchored target entropy, and dynamic global coefficient adjustment.<n>Experiments on multiple mathematical reasoning benchmarks show that AER consistently outperforms baselines, improving both reasoning accuracy and exploration capability.
arXiv Detail & Related papers (2025-10-13T03:10:26Z) - Rethinking Entropy Interventions in RLVR: An Entropy Change Perspective [11.65148836911294]
entropy collapse is a rapid loss of policy diversity, stemming from the exploration-exploitation imbalance and leading to a lack of generalization.<n>Recent entropy-intervention methods aim to prevent coloredtextentropy collapse, yet their underlying mechanisms remain unclear.<n>We introduce an entropy-change-aware reweighting scheme, namely Stabilizing Token-level Entropy-changE via Reweighting (STEER)
arXiv Detail & Related papers (2025-10-11T10:17:38Z) - Quantile Advantage Estimation for Entropy-Safe Reasoning [44.192277495613695]
Reinforcement Learning with Verifiable Rewards (RLVR) strengthens LLM reasoning, but training often oscillates between entropy collapse and entropy explosion<n>We trace both hazards to the mean baseline used in value-free RL, which improperly penalizes negative-advantage samples under reward outliers.<n>We propose Quantile Advantage Estimation (QAE), replacing the mean with a group-wise K-quantile baseline.
arXiv Detail & Related papers (2025-09-26T17:37:52Z) - EPO: Entropy-regularized Policy Optimization for LLM Agents Reinforcement Learning [15.529826552402769]
Training LLM agents in multi-turn environments with sparse rewards presents a fundamental challenge for reinforcement learning.<n>We identify a critical failure mode unique to this setting: the exploration-exploitation cascade failure.<n>We propose Entropy-regularized Policy Optimization (EPO), a general framework that breaks this failure cycle through three synergistic mechanisms.
arXiv Detail & Related papers (2025-09-26T16:51:44Z) - The Choice of Divergence: A Neglected Key to Mitigating Diversity Collapse in Reinforcement Learning with Verifiable Reward [58.559544190947584]
A central paradox in fine-tuning Large Language Models (LLMs) with Reinforcement Learning with Verifiable Reward (RLVR) is the frequent degradation of multi-attempt performance.<n>This is often accompanied by catastrophic forgetting, where models lose previously acquired skills.<n>We argue that standard RLVR objectives lack a crucial mechanism for knowledge retention.
arXiv Detail & Related papers (2025-09-09T06:34:32Z) - CURE: Critical-Token-Guided Re-Concatenation for Entropy-Collapse Prevention [24.71056659948577]
We introduce CURE (Critical-token-gUided Re concatenation for Entropy-collapse prevention), a two-stage framework that balances exploration and exploitation.<n>CURE achieves a 5% performance gain across six math benchmarks, establishing state-of-the-art performance in both entropy and accuracy.
arXiv Detail & Related papers (2025-08-14T18:40:34Z) - The Invisible Leash: Why RLVR May or May Not Escape Its Origin [47.488691410579925]
It remains unclear whether the current practice of RLVR truly expands a model's reasoning boundary.<n>Under current training conditions, RLVR can operate as a support-constrained optimization mechanism.<n>While RLVR reliably enhances precision, it may progressively narrow exploration and potentially overlook correct yet underrepresented solutions.
arXiv Detail & Related papers (2025-07-20T07:04:08Z) - Elucidated Rolling Diffusion Models for Probabilistic Weather Forecasting [52.6508222408558]
We introduce Elucidated Rolling Diffusion Models (ERDM)<n>ERDM is the first framework to unify a rolling forecast structure with the principled, performant design of Elucidated Diffusion Models (EDM)<n>On 2D Navier-Stokes simulations and ERA5 global weather forecasting at 1.5circ resolution, ERDM consistently outperforms key diffusion-based baselines.
arXiv Detail & Related papers (2025-06-24T21:44:31Z) - Theoretical Insights in Model Inversion Robustness and Conditional Entropy Maximization for Collaborative Inference Systems [89.35169042718739]
collaborative inference enables end users to leverage powerful deep learning models without exposure of sensitive raw data to cloud servers.<n>Recent studies have revealed that these intermediate features may not sufficiently preserve privacy, as information can be leaked and raw data can be reconstructed via model inversion attacks (MIAs)<n>This work first theoretically proves that the conditional entropy of inputs given intermediate features provides a guaranteed lower bound on the reconstruction mean square error (MSE) under any MIA.<n>Then, we derive a differentiable and solvable measure for bounding this conditional entropy based on the Gaussian mixture estimation and propose a conditional entropy algorithm to enhance the inversion robustness
arXiv Detail & Related papers (2025-03-01T07:15:21Z) - Adversarial Inverse Reinforcement Learning for Mean Field Games [17.392418397388823]
Mean field games (MFGs) provide a mathematically tractable framework for modelling large-scale multi-agent systems.
This paper proposes a novel framework, Mean-Field Adversarial IRL (MF-AIRL), which is capable of tackling uncertainties in demonstrations.
arXiv Detail & Related papers (2021-04-29T21:03:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.