$λ$-GRPO: Unifying the GRPO Frameworks with Learnable Token Preferences
- URL: http://arxiv.org/abs/2510.06870v2
- Date: Thu, 09 Oct 2025 03:27:04 GMT
- Title: $λ$-GRPO: Unifying the GRPO Frameworks with Learnable Token Preferences
- Authors: Yining Wang, Jinman Zhao, Chuangxin Zhao, Shuhao Guan, Gerald Penn, Shinan Liu,
- Abstract summary: We introduce a learnable parameter $lambda$ that adaptively controls token-level weighting.<n>We find that $lambda$-GRPO achieves consistent improvements over vanilla GRPO and DAPO.<n>These gains come without any modifications to the training data or additional computational cost.
- Score: 22.199479724764725
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Reinforcement Learning with Human Feedback (RLHF) has been the dominant approach for improving the reasoning capabilities of Large Language Models (LLMs). Recently, Reinforcement Learning with Verifiable Rewards (RLVR) has simplified this paradigm by replacing the reward and value models with rule-based verifiers. A prominent example is Group Relative Policy Optimization (GRPO). However, GRPO inherently suffers from a length bias, since the same advantage is uniformly assigned to all tokens of a response. As a result, longer responses distribute the reward over more tokens and thus contribute disproportionately to gradient updates. Several variants, such as DAPO and Dr. GRPO, modify the token-level aggregation of the loss, yet these methods remain heuristic and offer limited interpretability regarding their implicit token preferences. In this work, we explore the possibility of allowing the model to learn its own token preference during optimization. We unify existing frameworks under a single formulation and introduce a learnable parameter $\lambda$ that adaptively controls token-level weighting. We use $\lambda$-GRPO to denote our method, and we find that $\lambda$-GRPO achieves consistent improvements over vanilla GRPO and DAPO on multiple mathematical reasoning benchmarks. On Qwen2.5 models with 1.5B, 3B, and 7B parameters, $\lambda$-GRPO improves average accuracy by $+1.9\%$, $+1.0\%$, and $+1.7\%$ compared to GRPO, respectively. Importantly, these gains come without any modifications to the training data or additional computational cost, highlighting the effectiveness and practicality of learning token preferences.
Related papers
- iGRPO: Self-Feedback-Driven LLM Reasoning [88.83313431248473]
Large Language Models (LLMs) have shown promise in solving complex mathematical problems, yet they still fall short of producing accurate and consistent solutions.<n>We introduce Iterative Group Relative Policy Optimization (iGRPO), a two-stage extension of GRPO that adds dynamic self-conditioning through model-generated drafts.<n>Under matched rollout budgets, iGRPO consistently outperforms GRPO across base models.
arXiv Detail & Related papers (2026-02-09T18:45:11Z) - RC-GRPO: Reward-Conditioned Group Relative Policy Optimization for Multi-Turn Tool Calling Agents [40.88916135445381]
Multi-turn tool calling is challenging for Large Language Models because rewards are sparse and exploration is expensive.<n>A common recipe, SFT followed by GRPO, can stall when within-group reward variation is low.<n>We propose RC- GRPO, which treats exploration as a controllable steering problem via discrete reward tokens.
arXiv Detail & Related papers (2026-02-03T02:47:32Z) - GRPO-$λ$: Credit Assignment improves LLM Reasoning [35.452488047246646]
We present GRPO-$lambda$, a novel extension to GRPO that enhances credit assignment in RL finetuning of LLMs for complex reasoning tasks.<n>We compare GRPO-$lambda$ against GRPO by training models from 1.5B to 7B parameters on $4$ different math reasoning datasets.<n>With GRPO-$lambda$, the resulting average performance on AIME24, Math500, OlympiadMath, MinervaMath, and AMC improves over GRPO by over $3$ points and a $4.5$ points improvement on the 7B model.
arXiv Detail & Related papers (2025-09-30T19:11:10Z) - FlowRL: Matching Reward Distributions for LLM Reasoning [69.88820066093798]
We propose FlowRL: matching the full reward distribution via flow balancing instead of maximizing rewards in large language model (LLM) reinforcement learning (RL)<n>We transform scalar rewards into a normalized target distribution using a learnable partition function, and then minimize the reverse KL divergence between the policy and the target distribution.
arXiv Detail & Related papers (2025-09-18T17:56:36Z) - G$^2$RPO-A: Guided Group Relative Policy Optimization with Adaptive Guidance [1.0591274452539035]
We investigate Guided GRPO, which injects ground-truth reasoning steps into roll-out trajectories.<n>We find that naively adding guidance delivers limited gains.<n>Experiments on mathematical reasoning and code-generation benchmarks confirm that G$2$RPO-A substantially outperforms vanilla GRPO.
arXiv Detail & Related papers (2025-08-18T15:41:16Z) - Geometric-Mean Policy Optimization [117.05113769757172]
Group Relative Policy Optimization ( GRPO) has significantly enhanced the reasoning capability of large language models.<n> GRPO is observed to suffer from unstable policy updates when facing tokens with outlier importance-weighted rewards.<n>We propose Geometric-Mean Policy Optimization (GMPO) to improve the stability of GRPO through suppressing token reward outliers.
arXiv Detail & Related papers (2025-07-28T09:54:05Z) - Reinforcing Video Reasoning with Focused Thinking [65.85683941058916]
We propose TW-GRPO, a novel framework that enhances visual reasoning with focused thinking and dense reward granularity.<n>Specifically, we employ a token weighting mechanism that prioritizes tokens with high informational density.<n>We also reformulate RL training by shifting from single-choice to multi-choice QA tasks.
arXiv Detail & Related papers (2025-05-30T15:42:19Z) - Accelerating RL for LLM Reasoning with Optimal Advantage Regression [52.0792918455501]
We propose a novel two-stage policy optimization framework that directly approximates the optimal advantage function.<n>$A$*-PO achieves competitive performance across a wide range of mathematical reasoning benchmarks.<n>It reduces training time by up to 2$times$ and peak memory usage by over 30% compared to PPO, GRPO, and REBEL.
arXiv Detail & Related papers (2025-05-27T03:58:50Z) - Kalman Filter Enhanced GRPO for Reinforcement Learning-Based Language Model Reasoning [11.157278744897427]
Group Relative Policy Optimization ( GRPO) was proposed to compute the advantage for each output by subtracting the mean reward, as the baseline, for all outputs in the group.<n>We show that using a more adaptive advantage estimation model, KRPO can improve the stability and performance of GRPO.
arXiv Detail & Related papers (2025-05-12T13:09:49Z) - Not All Rollouts are Useful: Down-Sampling Rollouts in LLM Reinforcement Learning [55.15106182268834]
Reinforcement learning with verifiable rewards (RLVR) has emerged as the leading approach for enhancing reasoning capabilities in large language models.<n>It faces a fundamental compute and memory asymmetry: rollout generation is embarrassingly parallel and memory-light, whereas policy updates are communication-heavy and memory-intensive.<n>We introduce PODS (Policy Optimization with Down-Sampling), which decouples rollout generation from policy updates by training only on a strategically selected subset of rollouts.
arXiv Detail & Related papers (2025-04-18T17:49:55Z) - RePO: Understanding Preference Learning Through ReLU-Based Optimization [66.098833436503]
We propose ReLU-based Preference Optimization (RePO), a streamlined algorithm that eliminates $beta$ via two advances.<n>RePO is characterized as SimPO's limiting case ($beta to infty$), where the logistic weighting collapses to binary thresholding.<n> Empirical results on AlpacaEval 2 and Arena-Hard show that RePO outperforms DPO and SimPO across multiple base models.
arXiv Detail & Related papers (2025-03-10T15:11:07Z) - AlphaDPO: Adaptive Reward Margin for Direct Preference Optimization [45.46582930202524]
$alpha$-DPO is an adaptive preference optimization algorithm for large language models.<n>It balances the policy model and the reference model to achieve personalized reward margins.<n>It consistently outperforms DPO and SimPO across various model settings.
arXiv Detail & Related papers (2024-10-14T04:29:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.