Multi-Layer GRPO: Enhancing Reasoning and Self-Correction in Large Language Models
- URL: http://arxiv.org/abs/2506.04746v1
- Date: Thu, 05 Jun 2025 08:27:34 GMT
- Title: Multi-Layer GRPO: Enhancing Reasoning and Self-Correction in Large Language Models
- Authors: Fei Ding, Baiqiao Wang, Zijian Zeng, Youwei Wang,
- Abstract summary: We propose MGRPO (Multi-layer GRPO) to foster reasoning and self-correction abilities.<n>MGRPO significantly outperforms standard GRPO, achieving superior performance by fostering both reasoning and self-correction abilities.
- Score: 3.0763741715155666
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The Group Relative Policy Optimization (GRPO) algorithm has demonstrated considerable success in enhancing the reasoning capabilities of large language models (LLMs), as evidenced by DeepSeek-R1. However, the absence of intermediate supervision in GRPO frequently leads to inefficient exploration dynamics. A single error in a complex reasoning chain can invalidate the entire solution, resulting in abrupt reward vanishing and compromising training stability.To address these challenges, we propose MGRPO (Multi-layer GRPO). MGRPO operates in two layers: the first layer employs standard GRPO to generate an initial response. This response, along with the original query, is then fed into a second-layer GRPO process. This second layer is specifically trained to identify and correct errors in the initial response, effectively creating a self-correction loop. This mechanism provides implicit process-level supervision by rewarding successful error correction, without requiring an explicit, densely-annotated reward model. Experimental results on several mathematical reasoning benchmarks demonstrate that MGRPO significantly outperforms standard GRPO, achieving superior performance by fostering both reasoning and self-correction abilities.
Related papers
- WS-GRPO: Weakly-Supervised Group-Relative Policy Optimization for Rollout-Efficient Reasoning [67.45237332694025]
Group Relative Policy Optimization is effective for training language models on complex reasoning.<n>We propose Weakly Supervised GRPO, which improves rollout efficiency by converting terminal rewards into correctness-aware guidance.
arXiv Detail & Related papers (2026-02-19T02:43:35Z) - iGRPO: Self-Feedback-Driven LLM Reasoning [88.83313431248473]
Large Language Models (LLMs) have shown promise in solving complex mathematical problems, yet they still fall short of producing accurate and consistent solutions.<n>We introduce Iterative Group Relative Policy Optimization (iGRPO), a two-stage extension of GRPO that adds dynamic self-conditioning through model-generated drafts.<n>Under matched rollout budgets, iGRPO consistently outperforms GRPO across base models.
arXiv Detail & Related papers (2026-02-09T18:45:11Z) - DaGRPO: Rectifying Gradient Conflict in Reasoning via Distinctiveness-Aware Group Relative Policy Optimization [20.66452395111739]
We propose Distinctiveness-aware Group Relative Policy Optimization (DaGRPO)<n>DaGRPO incorporates two core mechanisms: (1) Sequence-level Gradient Rectification, which utilizes fine-grained scoring to dynamically mask sample pairs with low distinctiveness; and (2) Off-policy Data Augmentation, which introduces high-quality anchors to recover training signals for challenging tasks.<n>In-depth analysis confirms that DaGRPO effectively mitigates gradient explosion and accelerates the emergence of long-chain reasoning capabilities.
arXiv Detail & Related papers (2025-12-06T07:51:36Z) - Repurposing Synthetic Data for Fine-grained Search Agent Supervision [81.95597592711688]
LLM-based search agents are increasingly trained on entity-centric synthetic data.<n> prevailing training methods discard this rich entity information, relying instead on sparse, outcome-based rewards.<n>We introduce Entity-aware Group Relative Policy Optimization (E-GRPO), a novel framework that formulates a dense entity-aware reward function.
arXiv Detail & Related papers (2025-10-28T17:50:40Z) - GRPO is Secretly a Process Reward Model [5.637496960655903]
We show that the GRPO RL algorithm induces a non-trivial process reward model under real-world conditions.<n>We propose a simple modification to the algorithm to mitigate this defect.<n>Our results call into question the advantage of costly, explicitly-defined PRMs for GRPO.
arXiv Detail & Related papers (2025-09-25T13:40:36Z) - Group Causal Policy Optimization for Post-Training Large Language Models [10.791474908144703]
Group Relative Policy Optimization ( GRPO) treats candidate responses as independent, overlooking semantic interactions such as complementarity and contradiction.<n>We propose Group Causal Policy Optimization (GCPO), which integrates causal structure into optimization through two key components.<n>GCPO consistently surpasses existing methods, including GRPO across multiple reasoning benchmarks.
arXiv Detail & Related papers (2025-08-07T14:17:28Z) - EDGE-GRPO: Entropy-Driven GRPO with Guided Error Correction for Advantage Diversity [7.818698554631196]
Group Relative Policy Optimization (GRPO) algorithm relies on sparse reward rules, leading to the advantage collapse problem.<n>We propose the EDGE-GRPO algorithm, which adopts textbfEntropy-textbfDriven Advantage and textbfGuided textbfError Correction to effectively mitigate the problem of advantage collapse.
arXiv Detail & Related papers (2025-07-29T14:23:58Z) - GRPO-CARE: Consistency-Aware Reinforcement Learning for Multimodal Reasoning [53.894789613838654]
We introduce SEED-Bench-R1, a benchmark with complex real-world videos requiring balanced perception and reasoning.<n>Using SEED-Bench-R1, we find that standard GRPO, while improving answer accuracy, often reduces logical coherence between reasoning steps and answers, with only a 57.9% consistency rate.<n>We propose GRPO-CARE, a consistency-aware RL framework optimizing both answer correctness and reasoning coherence without explicit supervision.
arXiv Detail & Related papers (2025-06-19T08:49:13Z) - Rewarding the Unlikely: Lifting GRPO Beyond Distribution Sharpening [36.81125165911328]
We introduce the unlikeliness reward, a straightforward method that explicitly encourages reinforcing rare correct solutions.<n>Experiments confirm that incorporating the unlikeliness reward significantly improves pass@$N$ across a large range of N.<n>Applying our revised recipe to Lean, we achieve competitive performance with DeepSeek-Prover-V1.5-RL on the miniF2F-test benchmark.
arXiv Detail & Related papers (2025-06-03T01:15:15Z) - VerIPO: Cultivating Long Reasoning in Video-LLMs via Verifier-Gudied Iterative Policy Optimization [59.39976343879587]
VerIPO aims to gradually improve video LLMs' capacity for generating deep, long-term reasoning chains.<n>The training loop benefits from GRPO's expansive search and DPO's targeted optimization.<n>Our trained models exceed the direct inference of large-scale instruction-tuned Video-LLMs.
arXiv Detail & Related papers (2025-05-25T06:41:28Z) - On the Effect of Negative Gradient in Group Relative Deep Reinforcement Optimization [52.76330545825083]
Reinforcement learning (RL) has become popular in enhancing the reasoning capabilities of large language models (LLMs)<n>We identify a previously unrecognized phenomenon we term Lazy Likelihood Displacement (LLD), wherein the likelihood of correct responses marginally increases or even decreases during training.<n>We develop a method called NTHR, which downweights penalties on tokens contributing to the LLD. Unlike prior DPO-based approaches, NTHR takes advantage of GRPO's group-based structure, using correct responses as anchors to identify influential tokens.
arXiv Detail & Related papers (2025-05-24T18:58:51Z) - DisCO: Reinforcing Large Reasoning Models with Discriminative Constrained Optimization [55.06360285372418]
Group Relative Policy Optimization is a reinforcement learning method for large reasoning models (LRMs)<n>In this work, we analyze the GRPO objective under a binary reward setting and reveal an inherent limitation of question-level difficulty bias.<n>We introduce a new Discriminative Constrained Optimization framework for reinforcing LRMs, grounded in the principle of discriminative learning.
arXiv Detail & Related papers (2025-05-18T11:08:32Z) - A Minimalist Approach to LLM Reasoning: from Rejection Sampling to Reinforce [68.99924691391048]
We revisit GRPO from a reinforce-like algorithm perspective and analyze its core components.<n>We find that a simple rejection sampling baseline, RAFT, yields competitive performance than GRPO and PPO.<n>Motivated by this insight, we propose Reinforce-Rej, a minimal extension of policy gradient that filters both entirely incorrect and entirely correct samples.
arXiv Detail & Related papers (2025-04-15T16:15:02Z) - GRPO-LEAD: A Difficulty-Aware Reinforcement Learning Approach for Concise Mathematical Reasoning in Language Models [0.17265013728931003]
GRPO-LEAD is a suite of novel enhancements tailored for mathematical reasoning.<n>It introduces (1) a length-dependent accuracy reward to encourage concise and precise solutions, (2) an explicit penalty mechanism for incorrect answers to sharpen decision boundaries, and (3) a difficulty-aware advantage reweighting strategy that amplifies learning signals for challenging problems.
arXiv Detail & Related papers (2025-04-13T19:07:45Z) - Rethinking Model Selection and Decoding for Keyphrase Generation with
Pre-trained Sequence-to-Sequence Models [76.52997424694767]
Keyphrase Generation (KPG) is a longstanding task in NLP with widespread applications.
Seq2seq pre-trained language models (PLMs) have ushered in a transformative era for KPG, yielding promising performance improvements.
This paper undertakes a systematic analysis of the influence of model selection and decoding strategies on PLM-based KPG.
arXiv Detail & Related papers (2023-10-10T07:34:45Z) - AGRO: Adversarial Discovery of Error-prone groups for Robust
Optimization [109.91265884632239]
Group distributionally robust optimization (G-DRO) can minimize the worst-case loss over a set of pre-defined groups over training data.
We propose AGRO -- Adversarial Group discovery for Distributionally Robust Optimization.
AGRO results in 8% higher model performance on average on known worst-groups, compared to prior group discovery approaches.
arXiv Detail & Related papers (2022-12-02T00:57:03Z) - Group Distributionally Robust Reinforcement Learning with Hierarchical
Latent Variables [20.078557260741988]
Group Distributionally Robust Markov Decision Process (GDR-MDP) is a flexible hierarchical MDP formulation that encodes task groups via a latent mixture model.
GDR-MDP identifies the optimal policy that maximizes the expected return under the worst-possible qualified belief over task groups.
We then develop deep RL algorithms for GDR-MDP for both value-based and policy-based RL methods.
arXiv Detail & Related papers (2022-10-21T21:34:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.