Save the Good Prefix: Precise Error Penalization via Process-Supervised RL to Enhance LLM Reasoning
- URL: http://arxiv.org/abs/2601.18984v1
- Date: Mon, 26 Jan 2026 21:38:20 GMT
- Title: Save the Good Prefix: Precise Error Penalization via Process-Supervised RL to Enhance LLM Reasoning
- Authors: Haolin Liu, Dian Yu, Sidi Lu, Yujun Zhou, Rui Liu, Zhenwen Liang, Haitao Mi, Chen-Yu Wei, Dong Yu,
- Abstract summary: Reinforcement learning (RL) has emerged as a powerful framework for improving the reasoning capabilities of large language models (LLMs)<n>Most existing RL approaches rely on sparse outcome rewards, which fail to credit correct intermediate steps in partially successful solutions.<n>We propose Verifiable Prefix Policy Optimization (VPPO), which uses PRMs only to localize the first error during RL.
- Score: 59.76691952347156
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Reinforcement learning (RL) has emerged as a powerful framework for improving the reasoning capabilities of large language models (LLMs). However, most existing RL approaches rely on sparse outcome rewards, which fail to credit correct intermediate steps in partially successful solutions. Process reward models (PRMs) offer fine-grained step-level supervision, but their scores are often noisy and difficult to evaluate. As a result, recent PRM benchmarks focus on a more objective capability: detecting the first incorrect step in a reasoning path. However, this evaluation target is misaligned with how PRMs are typically used in RL, where their step-wise scores are treated as raw rewards to maximize. To bridge this gap, we propose Verifiable Prefix Policy Optimization (VPPO), which uses PRMs only to localize the first error during RL. Given an incorrect rollout, VPPO partitions the trajectory into a verified correct prefix and an erroneous suffix based on the first error, rewarding the former while applying targeted penalties only after the detected mistake. This design yields stable, interpretable learning signals and improves credit assignment. Across multiple reasoning benchmarks, VPPO consistently outperforms sparse-reward RL and prior PRM-guided baselines on both Pass@1 and Pass@K.
Related papers
- InT: Self-Proposed Interventions Enable Credit Assignment in LLM Reasoning [32.274434679047395]
Outcome-reward reinforcement learning (RL) has proven effective at improving the reasoning capabilities of large language models (LLMs)<n>Standard RL assigns credit only at the level of the final answer, penalizing entire reasoning traces when the outcome is incorrect.<n>We introduce Intervention Training (InT), a training paradigm in which the model performs fine-grained credit assignment on its own reasoning traces.
arXiv Detail & Related papers (2026-01-20T18:15:38Z) - Knowing the Answer Isn't Enough: Fixing Reasoning Path Failures in LVLMs [85.37131922131657]
We reveal a critical yet underexplored flaw in Large Vision-Language Models (LVLMs)<n>Even when these models know the correct answer, they frequently arrive there through incorrect reasoning paths.<n>We propose PSO (Path-Select Optimization), a two-stage post-training framework designed to enhance both the reasoning performance and stability of existing LVLMs.
arXiv Detail & Related papers (2025-12-06T03:02:55Z) - Local Coherence or Global Validity? Investigating RLVR Traces in Math Domains [13.626335241662977]
Reinforcement Learning with Verifiable Rewards (RLVR)-based post-training of Large Language Models (LLMs) has been shown to improve accuracy on reasoning tasks.<n>We investigate the effect of RL post-training on intermediate tokens which are not directly incentivized.
arXiv Detail & Related papers (2025-10-20T23:58:31Z) - Your Reward Function for RL is Your Best PRM for Search: Unifying RL and Search-Based TTS [62.22644307952087]
We introduce AIRL-S, the first natural unification of RL-based and search-based TTS.<n>We leverage adversarial inverse reinforcement learning (AIRL) combined with group relative policy optimization (GRPO) to learn a dense, dynamic PRM directly from correct reasoning traces.<n>Our results show that our unified approach improves performance by 9 % on average over the base model, matching GPT-4o.
arXiv Detail & Related papers (2025-08-19T23:41:15Z) - CAPO: Towards Enhancing LLM Reasoning through Generative Credit Assignment [44.33395106709674]
Reinforcement Learning with Verifiable Rewards (RLVR) has improved the reasoning abilities of Large Language Models (LLMs) by using rule-based binary feedback.<n>Current RLVR methods typically assign the same reward to every token.<n>This coarse-grained feedback hampers precise credit assignment, making it hard for models to identify which reasoning steps lead to success or failure.
arXiv Detail & Related papers (2025-08-04T11:06:08Z) - Beyond the First Error: Process Reward Models for Reflective Mathematical Reasoning [49.21525229904197]
We propose a novel data annotation method for PRMs specifically designed to score the long CoT reasoning process.<n>We introduce the concepts of Error Propagation and Error Cessation, enhancing PRMs' ability to identify both effective self-correction behaviors and reasoning based on erroneous steps.<n>Our PRM achieves superior performance across various metrics, including search guidance, BoN, and F1 scores.
arXiv Detail & Related papers (2025-05-20T14:12:05Z) - ReARTeR: Retrieval-Augmented Reasoning with Trustworthy Process Rewarding [25.329712997545794]
We propose Retrieval-Augmented Reasoning through Trustworthy Process Rewarding (ReARTeR)<n>ReARTeR enhances RAG systems' reasoning capabilities through post-training and test-time scaling.<n> Experimental results on multi-step reasoning benchmarks demonstrate significant improvements.
arXiv Detail & Related papers (2025-01-14T05:56:26Z) - Entropy-Regularized Process Reward Model [43.09203393852343]
Large language models (LLMs) have shown promise in performing complex multi-step reasoning, yet they continue to struggle with mathematical reasoning.<n>We propose an entropy-regularized process reward model (ER-PRM) that integrates KL-regularized Markov Decision Processes (MDP)<n>Our empirical experiments on the MATH and GSM8K benchmarks demonstrate that ER-PRM consistently outperforms existing process reward models.
arXiv Detail & Related papers (2024-12-15T01:09:23Z) - Rewarding Progress: Scaling Automated Process Verifiers for LLM Reasoning [90.23629291067763]
A promising approach for improving reasoning in large language models is to use process reward models (PRMs)
PRMs provide feedback at each step of a multi-step reasoning trace, potentially improving credit assignment over outcome reward models (ORMs)
To improve a base policy by running search against a PRM or using it as dense rewards for reinforcement learning (RL), we ask: "How should we design process rewards?"
We theoretically characterize the set of good provers and our results show that optimizing process rewards from such provers improves exploration during test-time search and online RL.
arXiv Detail & Related papers (2024-10-10T17:31:23Z) - VinePPO: Refining Credit Assignment in RL Training of LLMs [66.80143024475635]
We propose VinePPO, a straightforward approach that leverages the flexibility of language environments to compute unbiased Monte Carlo-based estimates.<n>Our method consistently outperforms PPO and other baselines across MATH and GSM8K datasets in less wall-clock time.
arXiv Detail & Related papers (2024-10-02T15:49:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.