Related papers: Mitigating Think-Answer Mismatch in LLM Reasoning Through Noise-Aware Advantage Reweighting

Mitigating Think-Answer Mismatch in LLM Reasoning Through Noise-Aware Advantage Reweighting

URL: http://arxiv.org/abs/2508.05928v1
Date: Fri, 08 Aug 2025 01:24:06 GMT
Title: Mitigating Think-Answer Mismatch in LLM Reasoning Through Noise-Aware Advantage Reweighting
Authors: Si Shen, Peijun Shen, Wenhua Zhao, Danhao Zhu,
Abstract summary: Group-Relative Policy Optimization (GRPO) is a key technique for training large reasoning models.<n>It suffers from a critical vulnerability: the emphThink-Answer Mismatch, where noisy reward signals corrupt the learning process.<n>We propose Stable Group-Relative Policy Optimization (S-GRPO), a principled enhancement that derives optimal, noise-aware advantage weights to stabilize training.
Score: 0.7365798659670144
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Group-Relative Policy Optimization (GRPO) is a key technique for training large reasoning models, yet it suffers from a critical vulnerability: the \emph{Think-Answer Mismatch}, where noisy reward signals corrupt the learning process. This problem is most severe in unbalanced response groups, paradoxically degrading the signal precisely when it should be most informative. To address this challenge, we propose Stable Group-Relative Policy Optimization (S-GRPO), a principled enhancement that derives optimal, noise-aware advantage weights to stabilize training. Our comprehensive experiments on mathematical reasoning benchmarks demonstrate S-GRPO's effectiveness and robustness. On various models, S-GRPO significantly outperforms DR. GRPO, achieving performance gains of +2.5% on Qwen-Math-7B-Base, +2.2% on Llama-3.2-3B-Base, and +2.4% on Qwen-Math-1.5B-Instruct. Most critically, while standard GRPO fails to learn under 20% synthetic reward noise, S-GRPO maintains stable learning progress. These results highlight S-GRPO's potential for more robust and effective training of large-scale reasoning models. \footnote{Code and data are available at: https://github.com/shenpeijun0212/S-GRPO

Related papers

iGRPO: Self-Feedback-Driven LLM Reasoning [88.83313431248473]
Large Language Models (LLMs) have shown promise in solving complex mathematical problems, yet they still fall short of producing accurate and consistent solutions.<n>We introduce Iterative Group Relative Policy Optimization (iGRPO), a two-stage extension of GRPO that adds dynamic self-conditioning through model-generated drafts.<n>Under matched rollout budgets, iGRPO consistently outperforms GRPO across base models.
arXiv Detail & Related papers (2026-02-09T18:45:11Z)
Your Group-Relative Advantage Is Biased [74.57406620907797]
Group-based learning methods rely on group-relative advantage estimation to avoid learned critics.<n>In this work, we uncover a fundamental issue of group-based RL: the group-relative advantage estimator is inherently biased relative to the true (expected) advantage.<n>We propose History-Aware Adaptive Difficulty Weighting (HA-DW), an adaptive reweighting scheme that adjusts advantage estimates based on an evolving difficulty anchor and training dynamics.
arXiv Detail & Related papers (2026-01-13T13:03:15Z)
DaGRPO: Rectifying Gradient Conflict in Reasoning via Distinctiveness-Aware Group Relative Policy Optimization [20.66452395111739]
We propose Distinctiveness-aware Group Relative Policy Optimization (DaGRPO)<n>DaGRPO incorporates two core mechanisms: (1) Sequence-level Gradient Rectification, which utilizes fine-grained scoring to dynamically mask sample pairs with low distinctiveness; and (2) Off-policy Data Augmentation, which introduces high-quality anchors to recover training signals for challenging tasks.<n>In-depth analysis confirms that DaGRPO effectively mitigates gradient explosion and accelerates the emergence of long-chain reasoning capabilities.
arXiv Detail & Related papers (2025-12-06T07:51:36Z)
Soft Adaptive Policy Optimization [67.61886077470528]
Reinforcement learning plays an increasingly important role in enhancing the reasoning capabilities of large language models.<n>Existing group-based policy optimization methods, such as GSPO and GRPO, alleviate this problem via hard clipping.<n>We propose Soft Adaptive Policy Optimization (SAPO), which replaces hard clipping with a smooth, temperature-controlled gate.
arXiv Detail & Related papers (2025-11-25T14:25:19Z)
NGRPO: Negative-enhanced Group Relative Policy Optimization [8.641009168869195]
A representative RLVR algorithm, GRPO, suffers from a critical limitation when all responses within a group are either entirely correct or entirely incorrect.<n>This is particularly problematic for homogeneously incorrect groups, where GRPO's advantage function yields a value of zero.<n>We propose NGRPO, an algorithm designed to convert homogeneous errors into robust learning signals.
arXiv Detail & Related papers (2025-09-23T09:38:10Z)
Geometric-Mean Policy Optimization [122.95205388291987]
We propose a stabilized variant of Group Relative Policy Optimization ( GRPO)<n>Instead of optimizing the arithmetic mean, GMPO maximizes the geometric mean of token-level rewards.<n>Beyond improved stability, GMPO-7B outperforms GRPO by an average of 4.1% on multiple mathematical benchmarks and 1.4% on multimodal reasoning benchmark.
arXiv Detail & Related papers (2025-07-28T09:54:05Z)
GRPO-CARE: Consistency-Aware Reinforcement Learning for Multimodal Reasoning [53.894789613838654]
We introduce SEED-Bench-R1, a benchmark with complex real-world videos requiring balanced perception and reasoning.<n>Using SEED-Bench-R1, we find that standard GRPO, while improving answer accuracy, often reduces logical coherence between reasoning steps and answers, with only a 57.9% consistency rate.<n>We propose GRPO-CARE, a consistency-aware RL framework optimizing both answer correctness and reasoning coherence without explicit supervision.
arXiv Detail & Related papers (2025-06-19T08:49:13Z)
DeepVideo-R1: Video Reinforcement Fine-Tuning via Difficulty-aware Regressive GRPO [37.07375927420007]
Group Relative Policy Optimization has shown impressive success using a PPO-style reinforcement algorithm with group-normalized rewards.<n>In this paper, we explore GRPO and identify two problems that deteriorate the effective learning.<n>We propose DeepVideo-R1, a video large language model trained with Reg- GRPO and difficulty-aware data augmentation.
arXiv Detail & Related papers (2025-06-09T06:15:54Z)
Rewarding the Unlikely: Lifting GRPO Beyond Distribution Sharpening [36.81125165911328]
Reinforcement learning is emerging as a primary driver for improving language model reasoning capabilities.<n>We investigate whether current reinforcement learning algorithms merely sharpen the base model's distribution around problems it can already solve.<n>We show that unlikeliness reward mitigates rank bias and improves pass@$N$ across a large range of $N$ in both synthetic and real theorem proving settings.
arXiv Detail & Related papers (2025-06-03T01:15:15Z)
On-Policy RL with Optimal Reward Baseline [109.47676554514193]
On-Policy RL with Optimal reward baseline (OPO) is a novel and simplified reinforcement learning algorithm.<n>OPO emphasizes the importance of exact on-policy training, which empirically stabilizes the training process and enhances exploration.<n>Results demonstrate OPO's superior performance and training stability without additional models or regularization terms.
arXiv Detail & Related papers (2025-05-29T15:58:04Z)
On the Effect of Negative Gradient in Group Relative Deep Reinforcement Optimization [52.76330545825083]
Reinforcement learning (RL) has become popular in enhancing the reasoning capabilities of large language models (LLMs)<n>We identify a previously unrecognized phenomenon we term Lazy Likelihood Displacement (LLD), wherein the likelihood of correct responses marginally increases or even decreases during training.<n>We develop a method called NTHR, which downweights penalties on tokens contributing to the LLD. Unlike prior DPO-based approaches, NTHR takes advantage of GRPO's group-based structure, using correct responses as anchors to identify influential tokens.
arXiv Detail & Related papers (2025-05-24T18:58:51Z)
Spectral Policy Optimization: Coloring your Incorrect Reasoning in GRPO [21.369307672809366]
Group Relative Policy optimization ( GRPO) stalls when all sampled responses in a group are incorrect.<n>We propose a framework that introduces response diversity within all-negative-sample groups in GRPO using AI feedback.<n>We empirically validate our approach, showing the improved performance across various model sizes.
arXiv Detail & Related papers (2025-05-16T18:02:05Z)
A Minimalist Approach to LLM Reasoning: from Rejection Sampling to Reinforce [68.99924691391048]
We revisit GRPO from a reinforce-like algorithm perspective and analyze its core components.<n>We find that a simple rejection sampling baseline, RAFT, yields competitive performance than GRPO and PPO.<n>Motivated by this insight, we propose Reinforce-Rej, a minimal extension of policy gradient that filters both entirely incorrect and entirely correct samples.
arXiv Detail & Related papers (2025-04-15T16:15:02Z)
Optimizing Safe and Aligned Language Generation: A Multi-Objective GRPO Approach [2.8626097661711394]
Reinforcement Learning from Human Feedback has achieved notable success in steering models, but is complex and can be unstable.<n>Recent approaches such as Direct Preference Optimization (DPO) simplify preference-based fine-tuning but may introduce bias or trade-off certain objectives.<n>We propose a Group Relative Policy Optimization framework with a multi-label reward regression model to achieve safe and aligned language generation.
arXiv Detail & Related papers (2025-03-26T05:50:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.