GRPO-$λ$: Credit Assignment improves LLM Reasoning
- URL: http://arxiv.org/abs/2510.00194v1
- Date: Tue, 30 Sep 2025 19:11:10 GMT
- Title: GRPO-$λ$: Credit Assignment improves LLM Reasoning
- Authors: Prasanna Parthasarathi, Mathieu Reymond, Boxing Chen, Yufei Cui, Sarath Chandar,
- Abstract summary: We present GRPO-$lambda$, a novel extension to GRPO that enhances credit assignment in RL finetuning of LLMs for complex reasoning tasks.<n>We compare GRPO-$lambda$ against GRPO by training models from 1.5B to 7B parameters on $4$ different math reasoning datasets.<n>With GRPO-$lambda$, the resulting average performance on AIME24, Math500, OlympiadMath, MinervaMath, and AMC improves over GRPO by over $3$ points and a $4.5$ points improvement on the 7B model.
- Score: 35.452488047246646
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) are increasingly deployed for tasks requiring complex reasoning, prompting significant interest in improving their reasoning abilities through post-training. Especially RL based methods using verifiable reward, like the state-of-the-art GRPO, have shown to tremendously improve reasoning behaviors when applied as post-training methods. However, the lack of an explicit reward or critic model limits GRPO's ability to assign fine-grained credit across token sequences. In this work, we present GRPO-$\lambda$, a novel extension to GRPO that enhances credit assignment in RL finetuning of LLMs for complex reasoning tasks. We approximate learning from $\lambda$-return with a reformulation of eligibility traces using token-level log-probabilities applied after each sequence generation, and a novel critic-free approximation of the temporal-difference error. We introduce a few variations for the weighting of the $\lambda$-return, and their applications to the eligibility-trace, where all the variations provide significant gains over GRPO. We compare GRPO-$\lambda$ against GRPO by training models from 1.5B to 7B parameters on $4$ different math reasoning datasets. The training plots demonstrate 30-40% improved performance during RL training on both LLaMA-3.1 and Qwen-2.5 architectures. Finally, we show that with GRPO-$\lambda$, the resulting average performance on AIME24, Math500, OlympiadMath, MinervaMath, and AMC improves over GRPO by over $3$ points and a $4.5$ points improvement on the 7B model.
Related papers
- iGRPO: Self-Feedback-Driven LLM Reasoning [88.83313431248473]
Large Language Models (LLMs) have shown promise in solving complex mathematical problems, yet they still fall short of producing accurate and consistent solutions.<n>We introduce Iterative Group Relative Policy Optimization (iGRPO), a two-stage extension of GRPO that adds dynamic self-conditioning through model-generated drafts.<n>Under matched rollout budgets, iGRPO consistently outperforms GRPO across base models.
arXiv Detail & Related papers (2026-02-09T18:45:11Z) - Reward and Guidance through Rubrics: Promoting Exploration to Improve Multi-Domain Reasoning [79.365697698062]
We propose $textbfRGR-GRPO (Reward and Guidance through rubrics), a framework for multi-domain reasoning.<n>RGR-GRPO consistently outperforms RL methods that rely solely on alternative reward schemes or offline guidance.
arXiv Detail & Related papers (2025-11-15T20:14:51Z) - Can GRPO Help LLMs Transcend Their Pretraining Origin? [42.200901132315636]
Group Relative Policy Optimization is a leading approach for enhancing the reasoning abilities of Large Language Models (LLMs)<n>Despite its wide adoption, GRPO's gains are often inconsistent.<n>This inconsistency raises a critical question: under what conditions does GRPO improve reasoning and generalize out-of-distribution (OOD)?<n>We first prove theoretically that GRPO is a conservative reweighting scheme, bounded by the base model's distribution and thus unable to discover completely novel solutions.
arXiv Detail & Related papers (2025-10-14T00:37:52Z) - $λ$-GRPO: Unifying the GRPO Frameworks with Learnable Token Preferences [22.199479724764725]
We introduce a learnable parameter $lambda$ that adaptively controls token-level weighting.<n>We find that $lambda$-GRPO achieves consistent improvements over vanilla GRPO and DAPO.<n>These gains come without any modifications to the training data or additional computational cost.
arXiv Detail & Related papers (2025-10-08T10:39:07Z) - GRPO is Secretly a Process Reward Model [5.637496960655903]
We show that the GRPO RL algorithm induces a non-trivial process reward model under real-world conditions.<n>We propose a simple modification to the algorithm to mitigate this defect.<n>Our results call into question the advantage of costly, explicitly-defined PRMs for GRPO.
arXiv Detail & Related papers (2025-09-25T13:40:36Z) - FlowRL: Matching Reward Distributions for LLM Reasoning [69.88820066093798]
We propose FlowRL: matching the full reward distribution via flow balancing instead of maximizing rewards in large language model (LLM) reinforcement learning (RL)<n>We transform scalar rewards into a normalized target distribution using a learnable partition function, and then minimize the reverse KL divergence between the policy and the target distribution.
arXiv Detail & Related papers (2025-09-18T17:56:36Z) - G$^2$RPO-A: Guided Group Relative Policy Optimization with Adaptive Guidance [1.0591274452539035]
We investigate Guided GRPO, which injects ground-truth reasoning steps into roll-out trajectories.<n>We find that naively adding guidance delivers limited gains.<n>Experiments on mathematical reasoning and code-generation benchmarks confirm that G$2$RPO-A substantially outperforms vanilla GRPO.
arXiv Detail & Related papers (2025-08-18T15:41:16Z) - Accelerating RL for LLM Reasoning with Optimal Advantage Regression [52.0792918455501]
We propose a novel two-stage policy optimization framework that directly approximates the optimal advantage function.<n>$A$*-PO achieves competitive performance across a wide range of mathematical reasoning benchmarks.<n>It reduces training time by up to 2$times$ and peak memory usage by over 30% compared to PPO, GRPO, and REBEL.
arXiv Detail & Related papers (2025-05-27T03:58:50Z) - CPPO: Accelerating the Training of Group Relative Policy Optimization-Based Reasoning Models [68.26281707780761]
This paper introduces Completion Pruning Policy Optimization (CPPO) to accelerate the training of reasoning models.<n>We show that CPPO achieves up to $8.32times$ speedup on GSM8K and $3.51times$ on Math while preserving or even enhancing the accuracy compared to the original GRPO.
arXiv Detail & Related papers (2025-03-28T11:30:05Z) - VinePPO: Refining Credit Assignment in RL Training of LLMs [66.80143024475635]
We propose VinePPO, a straightforward approach that leverages the flexibility of language environments to compute unbiased Monte Carlo-based estimates.<n>Our method consistently outperforms PPO and other baselines across MATH and GSM8K datasets in less wall-clock time.
arXiv Detail & Related papers (2024-10-02T15:49:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.