Related papers: Self-Guided Process Reward Optimization with Redefined Step-wise Advantage for Process Reinforcement Learning

Self-Guided Process Reward Optimization with Redefined Step-wise Advantage for Process Reinforcement Learning

URL: http://arxiv.org/abs/2507.01551v2
Date: Thu, 03 Jul 2025 10:33:08 GMT
Title: Self-Guided Process Reward Optimization with Redefined Step-wise Advantage for Process Reinforcement Learning
Authors: Wu Fei, Hao Kong, Shuxian Liang, Yang Lin, Yibo Yang, Jing Tang, Lei Chen, Xiansheng Hua,
Abstract summary: We present textbfSPRO, a novel framework that enables process-aware RL through two key innovations.<n>SPRO outperforms vaniila GRPO with 3.4x higher training efficiency and a 17.5% test accuracy improvement.<n> Notably, SPRO incurs no additional computational overhead compared to outcome-supervised RL methods such as GRPO, which benefit industrial implementation.
Score: 48.426139299991604
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Process Reinforcement Learning~(PRL) has demonstrated considerable potential in enhancing the reasoning capabilities of Large Language Models~(LLMs). However, introducing additional process reward models incurs substantial computational overhead, and there is no unified theoretical framework for process-level advantage estimation. To bridge this gap, we propose \textbf{S}elf-Guided \textbf{P}rocess \textbf{R}eward \textbf{O}ptimization~(\textbf{SPRO}), a novel framework that enables process-aware RL through two key innovations: (1) we first theoretically demonstrate that process rewards can be derived intrinsically from the policy model itself, and (2) we introduce well-defined cumulative process rewards and \textbf{M}asked \textbf{S}tep \textbf{A}dvantage (\textbf{MSA}), which facilitates rigorous step-wise action advantage estimation within shared-prompt sampling groups. Our experimental results demonstrate that SPRO outperforms vaniila GRPO with 3.4x higher training efficiency and a 17.5\% test accuracy improvement. Furthermore, SPRO maintains a stable and elevated policy entropy throughout training while reducing the average response length by approximately $1/3$, evidencing sufficient exploration and prevention of reward hacking. Notably, SPRO incurs no additional computational overhead compared to outcome-supervised RL methods such as GRPO, which benefit industrial implementation.

Related papers

Posterior-GRPO: Rewarding Reasoning Processes in Code Generation [7.893963076886232]
Reinforcement learning has significantly advanced code generation for large language models.<n>Current paradigms rely on outcome-based rewards from test cases, neglecting the quality of the intermediate reasoning process.<n>We introduce a unified framework that can effectively incorporate the quality of the reasoning process during RL.
arXiv Detail & Related papers (2025-08-07T09:04:10Z)
Truncated Proximal Policy Optimization [43.965892659920364]
Truncated Proximal Policy Optimization (T-PPO) improves training efficiency by streamlining policy update and length-restricted response generation.<n>We propose Extended Generalized Advantage Estimation (EGAE) for advantage estimation derived from incomplete responses.<n>We demonstrate the effectiveness and efficacy of T-PPO on AIME 2024 with a 32B base model.
arXiv Detail & Related papers (2025-06-18T01:21:38Z)
TreeRPO: Tree Relative Policy Optimization [55.97385410074841]
name is a novel method that estimates the mathematical expectations of rewards at various reasoning steps using tree sampling.<n>Building on the group-relative reward training mechanism of GRPO, name innovatively computes rewards based on step-level groups generated during tree sampling.
arXiv Detail & Related papers (2025-06-05T15:56:38Z)
Rewarding the Unlikely: Lifting GRPO Beyond Distribution Sharpening [36.81125165911328]
Reinforcement learning is emerging as a primary driver for improving language model reasoning capabilities.<n>We investigate whether current reinforcement learning algorithms merely sharpen the base model's distribution around problems it can already solve.<n>We show that unlikeliness reward mitigates rank bias and improves pass@$N$ across a large range of $N$ in both synthetic and real theorem proving settings.
arXiv Detail & Related papers (2025-06-03T01:15:15Z)
Accelerating RL for LLM Reasoning with Optimal Advantage Regression [52.0792918455501]
We propose a novel two-stage policy optimization framework that directly approximates the optimal advantage function.<n>$A$*-PO achieves competitive performance across a wide range of mathematical reasoning benchmarks.<n>It reduces training time by up to 2$times$ and peak memory usage by over 30% compared to PPO, GRPO, and REBEL.
arXiv Detail & Related papers (2025-05-27T03:58:50Z)
Reinforced Latent Reasoning for LLM-based Recommendation [83.18146814163308]
Large Language Models (LLMs) have demonstrated impressive reasoning capabilities in complex problem-solving tasks.<n>Existing methods typically rely on fine-tuning with explicit chain-of-thought (CoT) data.<n>In this work, we explore an alternative approach that shifts from explicit CoT reasoning to compact, information-dense latent reasoning.
arXiv Detail & Related papers (2025-05-25T11:03:45Z)
A Minimalist Approach to LLM Reasoning: from Rejection Sampling to Reinforce [68.99924691391048]
We revisit GRPO from a reinforce-like algorithm perspective and analyze its core components.<n>We find that a simple rejection sampling baseline, RAFT, yields competitive performance than GRPO and PPO.<n>Motivated by this insight, we propose Reinforce-Rej, a minimal extension of policy gradient that filters both entirely incorrect and entirely correct samples.
arXiv Detail & Related papers (2025-04-15T16:15:02Z)
Entropy-Regularized Process Reward Model [30.279394036823092]
Large language models (LLMs) have shown promise in performing complex multi-step reasoning, yet they continue to struggle with mathematical reasoning.<n>We propose an entropy-regularized process reward model (ER-PRM) that integrates KL-regularized Markov Decision Processes (MDP)<n>Our empirical experiments on the MATH and GSM8K benchmarks demonstrate that ER-PRM consistently outperforms existing process reward models.
arXiv Detail & Related papers (2024-12-15T01:09:23Z)
Rewarding Progress: Scaling Automated Process Verifiers for LLM Reasoning [90.23629291067763]
A promising approach for improving reasoning in large language models is to use process reward models (PRMs) PRMs provide feedback at each step of a multi-step reasoning trace, potentially improving credit assignment over outcome reward models (ORMs) To improve a base policy by running search against a PRM or using it as dense rewards for reinforcement learning (RL), we ask: "How should we design process rewards?" We theoretically characterize the set of good provers and our results show that optimizing process rewards from such provers improves exploration during test-time search and online RL.
arXiv Detail & Related papers (2024-10-10T17:31:23Z)
DPO Meets PPO: Reinforced Token Optimization for RLHF [35.638723885233475]
We introduce an algorithm that learns the token-wise reward function from preference data and performs policy optimization based on this learned token-wise reward signal.<n>Experiments demonstrate that textttRTO performs better than PPO and other direct preference learning algorithms.
arXiv Detail & Related papers (2024-04-29T17:58:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.