Segmental Advantage Estimation: Enhancing PPO for Long-Context LLM Training
- URL: http://arxiv.org/abs/2601.07320v1
- Date: Mon, 12 Jan 2026 08:41:47 GMT
- Title: Segmental Advantage Estimation: Enhancing PPO for Long-Context LLM Training
- Authors: Xue Gong, Qi Yi, Ziyuan Nan, Guanhua Huang, Kejiao Li, Yuhao Jiang, Ruibin Xiong, Zenan Xu, Jiaming Guo, Shaohui Peng, Bo Zhou,
- Abstract summary: Segmental Advantage Estimation mitigates the bias that Generalized Advantage Estimation can incur in Reinforcement Learning with Verifiable Rewards.<n> SAE achieves superior performance, with marked improvements in final scores, stability, and sample efficiency.
- Score: 17.530233901658253
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Training Large Language Models (LLMs) for reasoning tasks is increasingly driven by Reinforcement Learning with Verifiable Rewards (RLVR), where Proximal Policy Optimization (PPO) provides a principled framework for stable policy updates. However, the practical application of PPO is hindered by unreliable advantage estimation in the sparse-reward RLVR regime. This issue arises because the sparse rewards in RLVR lead to inaccurate intermediate value predictions, which in turn introduce significant bias when aggregated at every token by Generalized Advantage Estimation (GAE). To address this, we introduce Segmental Advantage Estimation (SAE), which mitigates the bias that GAE can incur in RLVR. Our key insight is that aggregating $n$-step advantages at every token(as in GAE) is unnecessary and often introduces excessive bias, since individual tokens carry minimal information. Instead, SAE first partitions the generated sequence into coherent sub-segments using low-probability tokens as heuristic boundaries. It then selectively computes variance-reduced advantage estimates only from these information-rich segment transitions, effectively filtering out noise from intermediate tokens. Our experiments demonstrate that SAE achieves superior performance, with marked improvements in final scores, training stability, and sample efficiency. These gains are shown to be consistent across multiple model sizes, and a correlation analysis confirms that our proposed advantage estimator achieves a higher correlation with an approximate ground-truth advantage, justifying its superior performance.
Related papers
- Blockwise Advantage Estimation for Multi-Objective RL with Verifiable Rewards [39.489554597919145]
Group Relative Policy Optimization (GRPO) assigns a single scalar advantage to all tokens in a completion.<n>For structured generations with explicit segments and objectives, this couples unrelated reward signals across segments, leading to objective interference and misattributed credit.<n>We propose Blockwise Advantage Estimation, a family of GRPO-compatible methods that assigns each objective its own advantage and applies it only to the tokens in the corresponding text block.
arXiv Detail & Related papers (2026-02-10T19:22:37Z) - Rethinking the Trust Region in LLM Reinforcement Learning [72.25890308541334]
Proximal Policy Optimization (PPO) serves as the de facto standard algorithm for Large Language Models (LLMs)<n>We propose Divergence Proximal Policy Optimization (DPPO), which substitutes clipping with a more principled constraint.<n>DPPO achieves superior training and efficiency compared to existing methods, offering a more robust foundation for RL-based fine-tuning.
arXiv Detail & Related papers (2026-02-04T18:59:04Z) - ReNCE: Learning to Reason by Noise Contrastive Estimation [7.590073864595161]
GRPO is a standard approach to endowing pretrained LLMs with reasoning capabilities.<n>We propose an explicit contrastive learning approach. Instead of estimating advantages, we bifurcate $K$ outcomes into positive and negative sets.
arXiv Detail & Related papers (2026-01-30T00:57:31Z) - Your Group-Relative Advantage Is Biased [74.57406620907797]
Group-based learning methods rely on group-relative advantage estimation to avoid learned critics.<n>In this work, we uncover a fundamental issue of group-based RL: the group-relative advantage estimator is inherently biased relative to the true (expected) advantage.<n>We propose History-Aware Adaptive Difficulty Weighting (HA-DW), an adaptive reweighting scheme that adjusts advantage estimates based on an evolving difficulty anchor and training dynamics.
arXiv Detail & Related papers (2026-01-13T13:03:15Z) - Outcome-Grounded Advantage Reshaping for Fine-Grained Credit Assignment in Mathematical Reasoning [60.00161035836637]
Group Relative Policy Optimization has emerged as a promising critic-free reinforcement learning paradigm for reasoning tasks.<n>We introduce Outcome-grounded Advantage Reshaping (OAR), a fine-grained credit assignment mechanism that redistributes advantages based on how much each token influences the model's final answer.<n>OAR-G achieves comparable gains with negligible computational overhead, both significantly outperforming a strong GRPO baseline.
arXiv Detail & Related papers (2026-01-12T10:48:02Z) - ZIP-RC: Optimizing Test-Time Compute via Zero-Overhead Joint Reward-Cost Prediction [57.799425838564]
We present ZIP-RC, an adaptive inference method that equips models with zero-overhead inference-time predictions of reward and cost.<n> ZIP-RC improves accuracy by up to 12% over majority voting at equal or lower average cost.
arXiv Detail & Related papers (2025-12-01T09:44:31Z) - ICPO: Intrinsic Confidence-Driven Group Relative Preference Optimization for Efficient Reinforcement Learning [17.98065634130798]
We propose the Intrinsic Confidence-Driven Group Relative Preference Optimization method (ICPO)<n>ICPO calculates a preference advantage score for each response by comparing the relative generation probabilities of multiple responses under the same input prompt.<n>We have discovered that the preference advantage score not only alleviates the issues of coarse-grained rewards and reward noise but also effectively curbs overconfident errors.
arXiv Detail & Related papers (2025-11-26T03:10:15Z) - Efficient Thought Space Exploration through Strategic Intervention [54.35208611253168]
We propose a novel Hint-Practice Reasoning (HPR) framework that operationalizes this insight through two synergistic components.<n>The framework's core innovation lies in Distributional Inconsistency Reduction (DIR), which dynamically identifies intervention points.<n> Experiments across arithmetic and commonsense reasoning benchmarks demonstrate HPR's state-of-the-art efficiency-accuracy tradeoffs.
arXiv Detail & Related papers (2025-11-13T07:26:01Z) - LaSeR: Reinforcement Learning with Last-Token Self-Rewarding [54.72617309922891]
Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a core paradigm for enhancing the reasoning capabilities of Large Language Models (LLMs)<n>Previous practice requires the LLM to sequentially generate solutions and self-verifications using two separate prompt templates, which significantly reduces efficiency.<n>We propose LaSeR (Reinforcement Learning with Last-Token Self-Rewarding), an algorithm that simply augments the original RLVR loss with a MSE loss.
arXiv Detail & Related papers (2025-10-16T17:55:11Z) - ASPO: Asymmetric Importance Sampling Policy Optimization [31.38346888572171]
The Importance Sampling (IS) ratios of positive-advantage tokens are mismatched, leading to unbalanced token weighting for positive and negative tokens.<n>This mismatch suppresses the update of low-probability tokens while over-amplifying already high-probability ones.<n>We propose Asymmetric Importance Sampling Policy Optimization (ASPO), which uses a simple yet effective strategy that flips the IS ratios of positive-advantage tokens.
arXiv Detail & Related papers (2025-10-07T15:54:24Z) - The Lie of the Average: How Class Incremental Learning Evaluation Deceives You? [48.83567710215299]
Class Incremental Learning (CIL) requires models to continuously learn new classes without forgetting previously learned ones.<n>We argue that a robust CIL evaluation protocol should accurately characterize and estimate the entire performance distribution.<n>We propose EDGE, an evaluation protocol that adaptively identifies and samples extreme class sequences using inter-task similarity.
arXiv Detail & Related papers (2025-09-26T17:00:15Z) - KTAE: A Model-Free Algorithm to Key-Tokens Advantage Estimation in Mathematical Reasoning [19.25257653416883]
Key-token Advantage Estimation (KTAE) is a novel algorithm that estimates fine-grained, token-level advantages without introducing additional models.<n>We show that models trained with GRPO+KTAE and DAPO+KTAE outperform baseline methods across five mathematical reasoning benchmarks.
arXiv Detail & Related papers (2025-05-22T16:00:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.