Unveiling Implicit Advantage Symmetry: Why GRPO Struggles with Exploration and Difficulty Adaptation
- URL: http://arxiv.org/abs/2602.05548v2
- Date: Thu, 12 Feb 2026 06:56:38 GMT
- Title: Unveiling Implicit Advantage Symmetry: Why GRPO Struggles with Exploration and Difficulty Adaptation
- Authors: Zhiqi Yu, Zhangquan Chen, Mengting Liu, Heye Zhang, Liangqiong Qu,
- Abstract summary: Group Relative Advantage Estimation (GRAE) has an implicit advantage symmetry inherent in it.<n>We propose Asymmetric GRAE, which dynamically modulates exploration incentives and sample-difficulty focus.<n> Experiments across seven benchmarks demonstrate that A-GRAE consistently improves GRPO and its variants across both LLMs and MLLMs.
- Score: 19.404286148401795
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Reinforcement Learning with Verifiable Rewards (RLVR), particularly GRPO, has become the standard for eliciting LLM reasoning. However, its efficiency in exploration and difficulty adaptation remains an open challenge. In this work, we argue that these bottlenecks stem from an implicit advantage symmetry inherent in Group Relative Advantage Estimation (GRAE). This symmetry induces two critical limitations: (i) at the group level, strict symmetry in weights between correct and incorrect trajectories leaves unsampled action logits unchanged, thereby hindering exploration of novel correct solution. (ii) at the sample level, the algorithm implicitly prioritizes medium-difficulty samples, remaining agnostic to the non-stationary demands of difficulty focus. Through controlled experiments, we reveal that this symmetric property is sub-optimal, yielding two pivotal insights: (i) asymmetrically suppressing the advantages of correct trajectories encourages essential exploration. (ii) learning efficiency is maximized by a curriculum-like transition-prioritizing simpler samples initially before gradually shifting to complex ones. Motivated by these findings, we propose Asymmetric GRAE (A-GRAE), which dynamically modulates exploration incentives and sample-difficulty focus. Experiments across seven benchmarks demonstrate that A-GRAE consistently improves GRPO and its variants across both LLMs and MLLMs.
Related papers
- MASPO: Unifying Gradient Utilization, Probability Mass, and Signal Reliability for Robust and Sample-Efficient LLM Reasoning [16.012761588513026]
Reinforcement Learning with Verifiable Rewards (RLVR) algorithms rely on rigid, uniform, and symmetric trust region mechanisms.<n>We propose Mass-Adaptive Soft Policy Optimization (MASPO), a unified framework designed to harmonize these three dimensions.<n> MASPO integrates a differentiable soft Gaussian gating to maximize gradient utility, a mass-adaptive limiter to balance exploration across the probability spectrum, and an asymmetric risk controller to align update magnitudes with signal confidence.
arXiv Detail & Related papers (2026-02-19T17:05:20Z) - OmniVL-Guard: Towards Unified Vision-Language Forgery Detection and Grounding via Balanced RL [63.388513841293616]
Existing forgery detection methods fail to handle the interleaved text, images, and videos prevalent in real-world misinformation.<n>To bridge this gap, this paper targets to develop a unified framework for omnibus vision-language forgery detection and grounding.<n>We propose textbf OmniVL-Guard, a balanced reinforcement learning framework for omnibus vision-language forgery detection and grounding.
arXiv Detail & Related papers (2026-02-11T09:41:36Z) - SetPO: Set-Level Policy Optimization for Diversity-Preserving LLM Reasoning [50.93295951454092]
We introduce a set level diversity objective defined over sampled trajectories using kernelized similarity.<n>Our approach derives a leave-one-out marginal contribution for each sampled trajectory and integrates this objective as a plug-in advantage shaping term for policy optimization.<n>Experiments across a range of model scales demonstrate the effectiveness of our proposed algorithm, consistently outperforming strong baselines in both Pass@1 and Pass@K across various benchmarks.
arXiv Detail & Related papers (2026-02-01T07:13:20Z) - TSPO: Breaking the Double Homogenization Dilemma in Multi-turn Search Policy Optimization [32.17940023097263]
Multi-turn tool-integrated reasoning enables Large Language Models (LLMs) to solve complex tasks through iterative information retrieval.<n>Current reinforcement learning (RL) frameworks for search-augmented reasoning rely on sparse outcome-level rewards.<n>We propose Turn-level Stage-aware Policy Optimization (TSPO) to address this problem.
arXiv Detail & Related papers (2026-01-30T09:58:45Z) - DaGRPO: Rectifying Gradient Conflict in Reasoning via Distinctiveness-Aware Group Relative Policy Optimization [20.66452395111739]
We propose Distinctiveness-aware Group Relative Policy Optimization (DaGRPO)<n>DaGRPO incorporates two core mechanisms: (1) Sequence-level Gradient Rectification, which utilizes fine-grained scoring to dynamically mask sample pairs with low distinctiveness; and (2) Off-policy Data Augmentation, which introduces high-quality anchors to recover training signals for challenging tasks.<n>In-depth analysis confirms that DaGRPO effectively mitigates gradient explosion and accelerates the emergence of long-chain reasoning capabilities.
arXiv Detail & Related papers (2025-12-06T07:51:36Z) - Reinforcement Learning Using known Invariances [54.91261509214309]
This paper develops a theoretical framework for incorporating known group symmetries into kernel-based reinforcement learning.<n>We show that symmetry-aware RL achieves significantly better performance than their standard kernel counterparts.
arXiv Detail & Related papers (2025-11-05T13:56:14Z) - Revisiting Zeroth-Order Optimization: Minimum-Variance Two-Point Estimators and Directionally Aligned Perturbations [57.179679246370114]
We identify the distribution of random perturbations that minimizes the estimator's variance as the perturbation stepsize tends to zero.<n>Our findings reveal that such desired perturbations can align directionally with the true gradient, instead of maintaining a fixed length.
arXiv Detail & Related papers (2025-10-22T19:06:39Z) - Sycophancy Mitigation Through Reinforcement Learning with Uncertainty-Aware Adaptive Reasoning Trajectories [58.988535279557546]
We introduce textbf sycophancy Mitigation through Adaptive Reasoning Trajectories.<n>We show that SMART significantly reduces sycophantic behavior while preserving strong performance on out-of-distribution inputs.
arXiv Detail & Related papers (2025-09-20T17:09:14Z) - Joint Asymmetric Loss for Learning with Noisy Labels [95.14298444251044]
symmetric losses usually suffer from the underfitting issue due to the overly strict constraint.<n>Within APL, symmetric losses have been successfully extended, yielding advanced robust loss functions.<n>We introduce a novel robust loss framework termed Joint Asymmetric Loss (JAL)
arXiv Detail & Related papers (2025-07-23T16:57:43Z) - Preference-Based Multi-Agent Reinforcement Learning: Data Coverage and Algorithmic Techniques [65.55451717632317]
We study Preference-Based Multi-Agent Reinforcement Learning (PbMARL)<n>We identify the Nash equilibrium from a preference-only offline dataset in general-sum games.<n>Our findings underscore the multifaceted approach required for PbMARL.
arXiv Detail & Related papers (2024-09-01T13:14:41Z) - First-principles construction of symmetry-informed quantum metrologies [0.0]
We develop a class of measurement strategies for quantities isomorphic to location parameters.
The resulting framework admits any parameter range, prior information, or state.
It reduces the search for good strategies to identifying which symmetry leaves a state of maximum ignorance invariant.
arXiv Detail & Related papers (2024-02-26T09:06:37Z) - Smoothness Adaptive Hypothesis Transfer Learning [8.557392136621894]
Smoothness Adaptive Transfer Learning (SATL) is a two-phase kernel ridge regression(KRR)-based algorithm.
We first prove that employing the misspecified fixed bandwidth Gaussian kernel in target-only KRR learning can achieve minimax optimality.
We derive the minimax lower bound of the learning problem in excess risk and show that SATL enjoys a matching upper bound up to a logarithmic factor.
arXiv Detail & Related papers (2024-02-22T21:02:19Z) - Symmetric Neural-Collapse Representations with Supervised Contrastive
Loss: The Impact of ReLU and Batching [26.994954303270575]
Supervised contrastive loss (SCL) is a competitive and often superior alternative to the cross-entropy loss for classification.
While prior studies have demonstrated that both losses yield symmetric training representations under balanced data, this symmetry breaks under class imbalances.
This paper presents an intriguing discovery: the introduction of a ReLU activation at the final layer effectively restores the symmetry in SCL-learned representations.
arXiv Detail & Related papers (2023-06-13T17:55:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.