Related papers: GRPO is Secretly a Process Reward Model

GRPO is Secretly a Process Reward Model

URL: http://arxiv.org/abs/2509.21154v2
Date: Wed, 08 Oct 2025 10:13:42 GMT
Title: GRPO is Secretly a Process Reward Model
Authors: Michael Sullivan,
Abstract summary: We show that the GRPO RL algorithm induces a non-trivial process reward model under real-world conditions.<n>We propose a simple modification to the algorithm to mitigate this defect.<n>Our results call into question the advantage of costly, explicitly-defined PRMs for GRPO.
Score: 5.637496960655903
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We prove theoretically that the GRPO RL algorithm induces a non-trivial process reward model (PRM), under certain assumptions regarding within-group overlap of token sequences across completions. We then show empirically that these assumptions are met under real-world conditions: GRPO does in fact induce a non-trivial PRM. Leveraging the framework of GRPO-as-a-PRM, we identify a flaw in the GRPO objective: non-uniformly distributed process steps hinder both exploration and exploitation (under different conditions). We propose a simple modification to the algorithm to mitigate this defect ($\lambda$-GRPO), and show that LLMs trained with $\lambda$-GRPO achieve higher validation accuracy and performance on downstream reasoning tasks$-$and reach peak performance more rapidly$-$than LLMs trained with standard GRPO. Our results call into question the advantage of costly, explicitly-defined PRMs for GRPO: we show that it is possible to instead leverage the hidden, built-in PRM structure within the vanilla GRPO algorithm to boost model performance with a negligible impact on training time and cost.

Related papers

Unbiased Dynamic Pruning for Efficient Group-Based Policy Optimization [60.87651283510059]
Group Relative Policy Optimization (GRPO) effectively scales LLM reasoning but incurs prohibitive computational costs.<n>We propose Dynamic Pruning Policy Optimization (DPPO), a framework that enables dynamic pruning while preserving unbiased gradient estimation.<n>To mitigate the data sparsity induced by pruning, we introduce Dense Prompt Packing, a window-based greedy strategy.
arXiv Detail & Related papers (2026-03-04T14:48:53Z)
iGRPO: Self-Feedback-Driven LLM Reasoning [88.83313431248473]
Large Language Models (LLMs) have shown promise in solving complex mathematical problems, yet they still fall short of producing accurate and consistent solutions.<n>We introduce Iterative Group Relative Policy Optimization (iGRPO), a two-stage extension of GRPO that adds dynamic self-conditioning through model-generated drafts.<n>Under matched rollout budgets, iGRPO consistently outperforms GRPO across base models.
arXiv Detail & Related papers (2026-02-09T18:45:11Z)
GRPO-RM: Fine-Tuning Representation Models via GRPO-Driven Reinforcement Learning [52.16150076582931]
We propose Group Relative Policy Optimization for Representation Model (GRPO-RM)<n>Our method establishes a predefined output set to functionally replace token sequence sampling in large language models (LLMs)<n>A specialized reward function is designed to accommodate the properties of representation models.
arXiv Detail & Related papers (2025-11-19T09:19:39Z)
Repurposing Synthetic Data for Fine-grained Search Agent Supervision [81.95597592711688]
LLM-based search agents are increasingly trained on entity-centric synthetic data.<n> prevailing training methods discard this rich entity information, relying instead on sparse, outcome-based rewards.<n>We introduce Entity-aware Group Relative Policy Optimization (E-GRPO), a novel framework that formulates a dense entity-aware reward function.
arXiv Detail & Related papers (2025-10-28T17:50:40Z)
GRPO-$λ$: Credit Assignment improves LLM Reasoning [35.452488047246646]
We present GRPO-$lambda$, a novel extension to GRPO that enhances credit assignment in RL finetuning of LLMs for complex reasoning tasks.<n>We compare GRPO-$lambda$ against GRPO by training models from 1.5B to 7B parameters on $4$ different math reasoning datasets.<n>With GRPO-$lambda$, the resulting average performance on AIME24, Math500, OlympiadMath, MinervaMath, and AMC improves over GRPO by over $3$ points and a $4.5$ points improvement on the 7B model.
arXiv Detail & Related papers (2025-09-30T19:11:10Z)
MO-GRPO: Mitigating Reward Hacking of Group Relative Policy Optimization on Multi-Objective Problems [18.92779479033295]
Group Relative Policy Optimization has been shown to be an effective algorithm when an accurate reward model is available.<n>We propose MO-GRPO, an extension of GRPO with a simple normalization method to reweight the reward functions automatically according to the variances of their values.<n>We show that MO-GRPO ensures that all reward functions contribute evenly to the loss function while preserving the order of preferences.
arXiv Detail & Related papers (2025-09-26T08:32:22Z)
G$^2$RPO-A: Guided Group Relative Policy Optimization with Adaptive Guidance [1.0591274452539035]
We investigate Guided GRPO, which injects ground-truth reasoning steps into roll-out trajectories.<n>We find that naively adding guidance delivers limited gains.<n>Experiments on mathematical reasoning and code-generation benchmarks confirm that G$2$RPO-A substantially outperforms vanilla GRPO.
arXiv Detail & Related papers (2025-08-18T15:41:16Z)
Multi-Layer GRPO: Enhancing Reasoning and Self-Correction in Large Language Models [3.0763741715155666]
We propose MGRPO (Multi-layer GRPO) to foster reasoning and self-correction abilities.<n>MGRPO significantly outperforms standard GRPO, achieving superior performance by fostering both reasoning and self-correction abilities.
arXiv Detail & Related papers (2025-06-05T08:27:34Z)
Stepwise Guided Policy Optimization: Coloring your Incorrect Reasoning in GRPO [22.00487909203855]
Group Relative Policy Optimization fails to update a policy when all responses within a group are incorrect.<n>This limitation underscores a key gap between artificial and human intelligence.<n>We introduce a simple framework that mitigates the all-negative-sample issue by incorporating response diversity within groups.
arXiv Detail & Related papers (2025-05-16T18:02:05Z)
A Minimalist Approach to LLM Reasoning: from Rejection Sampling to Reinforce [68.99924691391048]
We revisit GRPO from a reinforce-like algorithm perspective and analyze its core components.<n>We find that a simple rejection sampling baseline, RAFT, yields competitive performance than GRPO and PPO.<n>Motivated by this insight, we propose Reinforce-Rej, a minimal extension of policy gradient that filters both entirely incorrect and entirely correct samples.
arXiv Detail & Related papers (2025-04-15T16:15:02Z)
VinePPO: Refining Credit Assignment in RL Training of LLMs [66.80143024475635]
We propose VinePPO, a straightforward approach that leverages the flexibility of language environments to compute unbiased Monte Carlo-based estimates.<n>Our method consistently outperforms PPO and other baselines across MATH and GSM8K datasets in less wall-clock time.
arXiv Detail & Related papers (2024-10-02T15:49:30Z)
REBEL: Reinforcement Learning via Regressing Relative Rewards [59.68420022466047]
We propose REBEL, a minimalist RL algorithm for the era of generative models.<n>In theory, we prove that fundamental RL algorithms like Natural Policy Gradient can be seen as variants of REBEL.<n>We find that REBEL provides a unified approach to language modeling and image generation with stronger or similar performance as PPO and DPO.
arXiv Detail & Related papers (2024-04-25T17:20:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.