Related papers: Quantile Advantage Estimation for Entropy-Safe Reasoning

Quantile Advantage Estimation for Entropy-Safe Reasoning

URL: http://arxiv.org/abs/2509.22611v1
Date: Fri, 26 Sep 2025 17:37:52 GMT
Title: Quantile Advantage Estimation for Entropy-Safe Reasoning
Authors: Junkang Wu, Kexin Huang, Jiancan Wu, An Zhang, Xiang Wang, Xiangnan He,
Abstract summary: Reinforcement Learning with Verifiable Rewards (RLVR) strengthens LLM reasoning, but training often oscillates between entropy collapse and entropy explosion<n>We trace both hazards to the mean baseline used in value-free RL, which improperly penalizes negative-advantage samples under reward outliers.<n>We propose Quantile Advantage Estimation (QAE), replacing the mean with a group-wise K-quantile baseline.
Score: 44.192277495613695
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) strengthens LLM reasoning, but training often oscillates between {entropy collapse} and {entropy explosion}. We trace both hazards to the mean baseline used in value-free RL (e.g., GRPO and DAPO), which improperly penalizes negative-advantage samples under reward outliers. We propose {Quantile Advantage Estimation} (QAE), replacing the mean with a group-wise K-quantile baseline. QAE induces a response-level, two-regime gate: on hard queries (p <= 1 - K) it reinforces rare successes, while on easy queries (p > 1 - K) it targets remaining failures. Under first-order softmax updates, we prove {two-sided entropy safety}, giving lower and upper bounds on one-step entropy change that curb explosion and prevent collapse. Empirically, this minimal modification stabilizes entropy, sparsifies credit assignment (with tuned K, roughly 80% of responses receive zero advantage), and yields sustained pass@1 gains on Qwen3-8B/14B-Base across AIME 2024/2025 and AMC 2023. These results identify {baseline design} -- rather than token-level heuristics -- as the primary mechanism for scaling RLVR.

Related papers

Overconfident Errors Need Stronger Correction: Asymmetric Confidence Penalties for Reinforcement Learning [17.384089089363382]
We identify a root cause that existing methods overlook: the uniform penalization of errors.<n>Current approaches treat all incorrect rollouts within a group identically.<n>We propose the Asymmetric Confidence-aware Error Penalty (ACE)
arXiv Detail & Related papers (2026-02-24T22:46:43Z)
STAPO: Stabilizing Reinforcement Learning for LLMs by Silencing Rare Spurious Tokens [38.425692691443764]
ExistingReinforcement Learning (RL) fine-tuning methods rely heavily on entropy regularization and reweighting to maintain stability.<n>In practice, they often suffer from late-stage performance collapse, leading to degraded reasoning quality and unstable training.<n>We find that training instability can be caused by a tiny fraction of tokens, approximately 0.01%, which we term spurious tokens.<n>We propose Spurious-Token-Aware Policy Optimization (STAPO), which promotes stable and effective large-scale model refinement.
arXiv Detail & Related papers (2026-02-17T14:46:48Z)
Outcome-Grounded Advantage Reshaping for Fine-Grained Credit Assignment in Mathematical Reasoning [60.00161035836637]
Group Relative Policy Optimization has emerged as a promising critic-free reinforcement learning paradigm for reasoning tasks.<n>We introduce Outcome-grounded Advantage Reshaping (OAR), a fine-grained credit assignment mechanism that redistributes advantages based on how much each token influences the model's final answer.<n>OAR-G achieves comparable gains with negligible computational overhead, both significantly outperforming a strong GRPO baseline.
arXiv Detail & Related papers (2026-01-12T10:48:02Z)
Beyond High-Entropy Exploration: Correctness-Aware Low-Entropy Segment-Based Advantage Shaping for Reasoning LLMs [6.948242693954442]
Reinforcement Learning with Verifiable Rewards (RLVR) has become a central approach for improving the reasoning ability of large language models.<n>We propose LESS, a correctness-aware reinforcement framework that performs fine-grained advantage modulation over low-entropy segments.
arXiv Detail & Related papers (2025-11-30T14:19:36Z)
Mind Your Entropy: From Maximum Entropy to Trajectory Entropy-Constrained RL [56.085103402298905]
We propose a trajectory entropy-constrained reinforcement learning (TECRL) framework to address these two challenges.<n>Within this framework, we first separately learn two Q-functions, one associated with reward and the other with entropy, ensuring clean and stable value targets unaffected by temperature updates.<n>We develop a practical off-policy algorithm, DSAC-E, by extending the state-of-the-art distributional soft actor-critic with three refinements.
arXiv Detail & Related papers (2025-10-25T09:17:47Z)
Rediscovering Entropy Regularization: Adaptive Coefficient Unlocks Its Potential for LLM Reinforcement Learning [55.59724323303857]
We propose a framework that balances exploration and exploitation via three components: difficulty-aware coefficient allocation, initial-anchored target entropy, and dynamic global coefficient adjustment.<n>Experiments on multiple mathematical reasoning benchmarks show that AER consistently outperforms baselines, improving both reasoning accuracy and exploration capability.
arXiv Detail & Related papers (2025-10-13T03:10:26Z)
Reinforcement Learning with Verifiable yet Noisy Rewards under Imperfect Verifiers [90.50039419576807]
Reinforcement Learning with Verifiable Rewards (RLVR) trains policies against automated verifiers to avoid costly human labeling.<n>To reduce vulnerability to verifier hacking, many RLVR systems collapse rewards to binary $0,1$ during training.<n>This choice carries a cost: it introduces textitfalse negatives (rejecting correct answers, FNs) and textitfalse positives (accepting incorrect ones, FPs)
arXiv Detail & Related papers (2025-10-01T13:56:44Z)
Clip-Low Increases Entropy and Clip-High Decreases Entropy in Reinforcement Learning of Large Language Models [29.822717720666134]
We show that the clipping mechanism in PPO and GRPO induces biases on entropy.<n>With a more aggressive clip-low value, one can increase entropy, promote exploration, and ultimately prevent entropy collapse in RLVR training.
arXiv Detail & Related papers (2025-09-30T11:33:15Z)
The Choice of Divergence: A Neglected Key to Mitigating Diversity Collapse in Reinforcement Learning with Verifiable Reward [58.559544190947584]
A central paradox in fine-tuning Large Language Models (LLMs) with Reinforcement Learning with Verifiable Reward (RLVR) is the frequent degradation of multi-attempt performance.<n>This is often accompanied by catastrophic forgetting, where models lose previously acquired skills.<n>We argue that standard RLVR objectives lack a crucial mechanism for knowledge retention.
arXiv Detail & Related papers (2025-09-09T06:34:32Z)
Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning [80.87085014818052]
Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a powerful approach to enhancing the reasoning capabilities of Large Language Models (LLMs)<n>In this work, we undertake a pioneering exploration of RLVR through the novel perspective of token entropy patterns.<n>We observe that only a small fraction of tokens exhibit high entropy, and these tokens act as critical forks that steer the model toward diverse reasoning pathways.
arXiv Detail & Related papers (2025-06-02T17:54:39Z)
Stable Reinforcement Learning for Efficient Reasoning [2.838966689544288]
GRPO-$lambda$ is an efficient and stabilized variant of GRPO.<n>It dynamically adjusts the reward strategy by monitoring the correctness ratio.<n>It improves average accuracy by 1.48% while reducing CoT sequence length by 47.3%.
arXiv Detail & Related papers (2025-05-23T16:43:03Z)
A Piecewise Lyapunov Analysis of Sub-quadratic SGD: Applications to Robust and Quantile Regression [22.917692982875025]
We introduce a novel piecewise Lyapunov function that enables us to handle functions $f$ with only first-order differentiability.<n>We derive finite-time moment bounds under general diminishing stepsizes, as well as constant stepsizes.<n>Our results have wide applications, especially in online statistical methods.
arXiv Detail & Related papers (2025-04-11T00:20:37Z)
Tilted Quantile Gradient Updates for Quantile-Constrained Reinforcement Learning [12.721239079824622]
We propose a safe reinforcement learning (RL) paradigm that enables a higher level of safety without any expectation-form approximations.<n>A tilted update strategy for quantile gradients is implemented to compensate the asymmetric distributional density.<n>Experiments demonstrate that the proposed model fully meets safety requirements (quantile constraints) while outperforming the state-of-the-art benchmarks with higher return.
arXiv Detail & Related papers (2024-12-17T18:58:00Z)
Understanding, Predicting and Better Resolving Q-Value Divergence in Offline-RL [86.0987896274354]
We first identify a fundamental pattern, self-excitation, as the primary cause of Q-value estimation divergence in offline RL. We then propose a novel Self-Excite Eigenvalue Measure (SEEM) metric to measure the evolving property of Q-network at training. For the first time, our theory can reliably decide whether the training will diverge at an early stage.
arXiv Detail & Related papers (2023-10-06T17:57:44Z)
Approximation Schemes for ReLU Regression [80.33702497406632]
We consider the fundamental problem of ReLU regression. The goal is to output the best fitting ReLU with respect to square loss given to draws from some unknown distribution.
arXiv Detail & Related papers (2020-05-26T16:26:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.