Well Begun, Half Done: Reinforcement Learning with Prefix Optimization for LLM Reasoning
- URL: http://arxiv.org/abs/2512.15274v1
- Date: Wed, 17 Dec 2025 10:26:11 GMT
- Title: Well Begun, Half Done: Reinforcement Learning with Prefix Optimization for LLM Reasoning
- Authors: Yiliu Sun, Zicheng Zhao, Yang Wei, Yanfang Zhang, Chen Gong,
- Abstract summary: Reinforcement Learning with Verifiable Rewards (RLVR) significantly enhances the reasoning capability of Large Language Models (LLMs)<n>Current RLVR approaches typically conduct training across all generated tokens, but neglect to explore which tokens (e.g., prefix tokens) actually contribute to reasoning.<n>We propose a novel RLVR approach called Progressive Prefix-token Policy Optimization (PPPO), which highlights the significance of the prefix segment of generated outputs.
- Score: 18.477056077256233
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) significantly enhances the reasoning capability of Large Language Models (LLMs). Current RLVR approaches typically conduct training across all generated tokens, but neglect to explore which tokens (e.g., prefix tokens) actually contribute to reasoning. This uniform training strategy spends substantial effort on optimizing low-return tokens, which in turn impedes the potential improvement from high-return tokens and reduces overall training effectiveness. To address this issue, we propose a novel RLVR approach called Progressive Prefix-token Policy Optimization (PPPO), which highlights the significance of the prefix segment of generated outputs. Specifically, inspired by the well-established human thinking theory of Path Dependence, where early-stage thoughts substantially constrain subsequent thinking trajectory, we identify an analogous phenomenon in LLM reasoning termed Beginning Lock-in Effect (BLE). PPPO leverages this finding by focusing its optimization objective on the prefix reasoning process of LLMs. This targeted optimization strategy can positively influence subsequent reasoning processes, and ultimately improve final results. To improve the learning effectiveness of LLMs on how to start reasoning with high quality, PPPO introduces two training strategies: (a) Progressive Prefix Retention, which shapes a progressive learning process by increasing the proportion of retained prefix tokens during training; (b) Continuation Accumulated Reward, which mitigates reward bias by sampling multiple continuations for one prefix token sequence, and accumulating their scores as the reward signal. Extensive experimental results on various reasoning tasks demonstrate that our proposed PPPO outperforms representative RLVR methods, with the accuracy improvements of 18.02% on only 26.17% training tokens.
Related papers
- Principled RL for Diffusion LLMs Emerges from a Sequence-Level Perspective [85.06838178922791]
Reinforcement Learning (RL) has proven highly effective for autoregressive language models.<n>But adapting these methods to diffusion large language models (dLLMs) presents fundamental challenges.<n>We propose a principled RL framework that treats entire sequence generation as a single action and uses the ELBO as a tractable sequence-level likelihood proxy.
arXiv Detail & Related papers (2025-12-03T13:05:32Z) - Rectifying LLM Thought from Lens of Optimization [48.98086817378953]
Long chain-of-thought (CoT) prompting enables thorough exploration and deliberation.<n>Despite advances, long-CoT LLMs often exhibit suboptimal reasoning behaviors.<n>We introduce RePro, a novel approach to refine LLM reasoning during post-training.
arXiv Detail & Related papers (2025-12-01T17:41:08Z) - Explore Data Left Behind in Reinforcement Learning for Reasoning Language Models [61.78513830395669]
Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as an effective approach for improving the reasoning abilities of large language models (LLMs)<n>As models train longer and scale larger, more training prompts become residual prompts, those with zero variance rewards that provide no training signal.<n>We propose the Explore Residual Prompts in Policy Optimization framework, which encourages exploration on residual prompts and reactivates their training signals.
arXiv Detail & Related papers (2025-11-06T20:40:27Z) - Conditional Advantage Estimation for Reinforcement Learning in Large Reasoning Models [50.84995206660551]
We introduce Conditional advANtage estimatiON (CANON) to amplify the impact of a target metric without presuming its direction.<n>CANON based on entropy consistently outperforms prior methods on both math reasoning and high-complexity logic tasks.
arXiv Detail & Related papers (2025-09-28T16:33:07Z) - Reinforcement Learning on Pre-Training Data [55.570379963147424]
We introduce Reinforcement Learning on Pre-Training data (R), a new training-time scaling paradigm for optimizing large language models (LLMs)<n>R enables the policy to autonomously explore meaningful trajectories to learn from pre-training data and improve its capability through reinforcement learning (RL)<n>Extensive experiments on both general-domain and mathematical reasoning benchmarks across multiple models validate the effectiveness of R.
arXiv Detail & Related papers (2025-09-23T17:10:40Z) - Dense Reward for Free in Reinforcement Learning from Human Feedback [64.92448888346125]
We leverage the fact that the reward model contains more information than just its scalar output.
We use these attention weights to redistribute the reward along the whole completion.
Empirically, we show that it stabilises training, accelerates the rate of learning, and, in practical cases, may lead to better local optima.
arXiv Detail & Related papers (2024-02-01T17:10:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.