GRPO is Secretly a Process Reward Model
- URL: http://arxiv.org/abs/2509.21154v2
- Date: Wed, 08 Oct 2025 10:13:42 GMT
- Title: GRPO is Secretly a Process Reward Model
- Authors: Michael Sullivan,
- Abstract summary: We show that the GRPO RL algorithm induces a non-trivial process reward model under real-world conditions.<n>We propose a simple modification to the algorithm to mitigate this defect.<n>Our results call into question the advantage of costly, explicitly-defined PRMs for GRPO.
- Score: 5.637496960655903
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We prove theoretically that the GRPO RL algorithm induces a non-trivial process reward model (PRM), under certain assumptions regarding within-group overlap of token sequences across completions. We then show empirically that these assumptions are met under real-world conditions: GRPO does in fact induce a non-trivial PRM. Leveraging the framework of GRPO-as-a-PRM, we identify a flaw in the GRPO objective: non-uniformly distributed process steps hinder both exploration and exploitation (under different conditions). We propose a simple modification to the algorithm to mitigate this defect ($\lambda$-GRPO), and show that LLMs trained with $\lambda$-GRPO achieve higher validation accuracy and performance on downstream reasoning tasks$-$and reach peak performance more rapidly$-$than LLMs trained with standard GRPO. Our results call into question the advantage of costly, explicitly-defined PRMs for GRPO: we show that it is possible to instead leverage the hidden, built-in PRM structure within the vanilla GRPO algorithm to boost model performance with a negligible impact on training time and cost.
Related papers
- Unbiased Dynamic Pruning for Efficient Group-Based Policy Optimization [60.87651283510059]
Group Relative Policy Optimization (GRPO) effectively scales LLM reasoning but incurs prohibitive computational costs.<n>We propose Dynamic Pruning Policy Optimization (DPPO), a framework that enables dynamic pruning while preserving unbiased gradient estimation.<n>To mitigate the data sparsity induced by pruning, we introduce Dense Prompt Packing, a window-based greedy strategy.
arXiv Detail & Related papers (2026-03-04T14:48:53Z) - iGRPO: Self-Feedback-Driven LLM Reasoning [88.83313431248473]
Large Language Models (LLMs) have shown promise in solving complex mathematical problems, yet they still fall short of producing accurate and consistent solutions.<n>We introduce Iterative Group Relative Policy Optimization (iGRPO), a two-stage extension of GRPO that adds dynamic self-conditioning through model-generated drafts.<n>Under matched rollout budgets, iGRPO consistently outperforms GRPO across base models.
arXiv Detail & Related papers (2026-02-09T18:45:11Z) - GRPO-RM: Fine-Tuning Representation Models via GRPO-Driven Reinforcement Learning [52.16150076582931]
We propose Group Relative Policy Optimization for Representation Model (GRPO-RM)<n>Our method establishes a predefined output set to functionally replace token sequence sampling in large language models (LLMs)<n>A specialized reward function is designed to accommodate the properties of representation models.
arXiv Detail & Related papers (2025-11-19T09:19:39Z) - Repurposing Synthetic Data for Fine-grained Search Agent Supervision [81.95597592711688]
LLM-based search agents are increasingly trained on entity-centric synthetic data.<n> prevailing training methods discard this rich entity information, relying instead on sparse, outcome-based rewards.<n>We introduce Entity-aware Group Relative Policy Optimization (E-GRPO), a novel framework that formulates a dense entity-aware reward function.
arXiv Detail & Related papers (2025-10-28T17:50:40Z) - GRPO-$λ$: Credit Assignment improves LLM Reasoning [35.452488047246646]
We present GRPO-$lambda$, a novel extension to GRPO that enhances credit assignment in RL finetuning of LLMs for complex reasoning tasks.<n>We compare GRPO-$lambda$ against GRPO by training models from 1.5B to 7B parameters on $4$ different math reasoning datasets.<n>With GRPO-$lambda$, the resulting average performance on AIME24, Math500, OlympiadMath, MinervaMath, and AMC improves over GRPO by over $3$ points and a $4.5$ points improvement on the 7B model.
arXiv Detail & Related papers (2025-09-30T19:11:10Z) - MO-GRPO: Mitigating Reward Hacking of Group Relative Policy Optimization on Multi-Objective Problems [18.92779479033295]
Group Relative Policy Optimization has been shown to be an effective algorithm when an accurate reward model is available.<n>We propose MO-GRPO, an extension of GRPO with a simple normalization method to reweight the reward functions automatically according to the variances of their values.<n>We show that MO-GRPO ensures that all reward functions contribute evenly to the loss function while preserving the order of preferences.
arXiv Detail & Related papers (2025-09-26T08:32:22Z) - G$^2$RPO-A: Guided Group Relative Policy Optimization with Adaptive Guidance [1.0591274452539035]
We investigate Guided GRPO, which injects ground-truth reasoning steps into roll-out trajectories.<n>We find that naively adding guidance delivers limited gains.<n>Experiments on mathematical reasoning and code-generation benchmarks confirm that G$2$RPO-A substantially outperforms vanilla GRPO.
arXiv Detail & Related papers (2025-08-18T15:41:16Z) - Multi-Layer GRPO: Enhancing Reasoning and Self-Correction in Large Language Models [3.0763741715155666]
We propose MGRPO (Multi-layer GRPO) to foster reasoning and self-correction abilities.<n>MGRPO significantly outperforms standard GRPO, achieving superior performance by fostering both reasoning and self-correction abilities.
arXiv Detail & Related papers (2025-06-05T08:27:34Z) - Stepwise Guided Policy Optimization: Coloring your Incorrect Reasoning in GRPO [22.00487909203855]
Group Relative Policy Optimization fails to update a policy when all responses within a group are incorrect.<n>This limitation underscores a key gap between artificial and human intelligence.<n>We introduce a simple framework that mitigates the all-negative-sample issue by incorporating response diversity within groups.
arXiv Detail & Related papers (2025-05-16T18:02:05Z) - A Minimalist Approach to LLM Reasoning: from Rejection Sampling to Reinforce [68.99924691391048]
We revisit GRPO from a reinforce-like algorithm perspective and analyze its core components.<n>We find that a simple rejection sampling baseline, RAFT, yields competitive performance than GRPO and PPO.<n>Motivated by this insight, we propose Reinforce-Rej, a minimal extension of policy gradient that filters both entirely incorrect and entirely correct samples.
arXiv Detail & Related papers (2025-04-15T16:15:02Z) - VinePPO: Refining Credit Assignment in RL Training of LLMs [66.80143024475635]
We propose VinePPO, a straightforward approach that leverages the flexibility of language environments to compute unbiased Monte Carlo-based estimates.<n>Our method consistently outperforms PPO and other baselines across MATH and GSM8K datasets in less wall-clock time.
arXiv Detail & Related papers (2024-10-02T15:49:30Z) - REBEL: Reinforcement Learning via Regressing Relative Rewards [59.68420022466047]
We propose REBEL, a minimalist RL algorithm for the era of generative models.<n>In theory, we prove that fundamental RL algorithms like Natural Policy Gradient can be seen as variants of REBEL.<n>We find that REBEL provides a unified approach to language modeling and image generation with stronger or similar performance as PPO and DPO.
arXiv Detail & Related papers (2024-04-25T17:20:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.