DIVA-GRPO: Enhancing Multimodal Reasoning through Difficulty-Adaptive Variant Advantage
- URL: http://arxiv.org/abs/2603.01106v1
- Date: Sun, 01 Mar 2026 13:47:35 GMT
- Title: DIVA-GRPO: Enhancing Multimodal Reasoning through Difficulty-Adaptive Variant Advantage
- Authors: Haowen Gao, Zhenyu Zhang, Liang Pang, Fangda Guo, Hongjian Dou, Guannan Lv, Shaoguo Liu, Tingting Gao, Huawei Shen, Xueqi Cheng,
- Abstract summary: Reinforcement learning with group relative policy optimization ( GRPO) has become a widely adopted approach for enhancing the reasoning capabilities of multimodal large language models (MLLMs)<n> GRPO enables long-chain reasoning without a critic, but it often suffers from sparse rewards on difficult problems and advantage vanishing when group-level rewards are too consistent for overly easy or hard problems.<n>We propose DIVA-GRPO, a difficulty-adaptive variant advantage method that adjusts variant difficulty distributions from a global perspective.
- Score: 83.64031699341862
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Reinforcement learning (RL) with group relative policy optimization (GRPO) has become a widely adopted approach for enhancing the reasoning capabilities of multimodal large language models (MLLMs). While GRPO enables long-chain reasoning without a critic, it often suffers from sparse rewards on difficult problems and advantage vanishing when group-level rewards are too consistent for overly easy or hard problems. Existing solutions (sample expansion, selective utilization, and indirect reward design) often fail to maintain enough variance in within-group reward distributions to yield clear optimization signals. To address this, we propose DIVA-GRPO, a difficulty-adaptive variant advantage method that adjusts variant difficulty distributions from a global perspective. DIVA-GRPO dynamically assesses problem difficulty, samples variants with appropriate difficulty levels, and calculates advantages across local and global groups using difficulty-weighted and normalized scaling. This alleviates reward sparsity and advantage vanishing while improving training stability. Extensive experiments on six reasoning benchmarks demonstrate that DIVA-GRPO outperforms existing approaches in training efficiency and reasoning performance. Code: https://github.com/Siaaaaaa1/DIVA-GRPO
Related papers
- Enhancing Multi-Modal LLMs Reasoning via Difficulty-Aware Group Normalization [38.26061472669552]
We propose difficulty-aware group normalization (Durian), which re-groups samples by difficulty levels and shares within each group.<n>Our approach preserves intra-group distinctions while eliminating sensitivity to extreme cases, yielding significant performance gains across multiple multimodal reasoning benchmarks.
arXiv Detail & Related papers (2026-02-25T09:52:50Z) - WS-GRPO: Weakly-Supervised Group-Relative Policy Optimization for Rollout-Efficient Reasoning [67.45237332694025]
Group Relative Policy Optimization is effective for training language models on complex reasoning.<n>We propose Weakly Supervised GRPO, which improves rollout efficiency by converting terminal rewards into correctness-aware guidance.
arXiv Detail & Related papers (2026-02-19T02:43:35Z) - Transform-Augmented GRPO Improves Pass@k [50.3707071191733]
Group Relative Policy Optimization (GRPO) was designed to improve reasoning, but it worsens this situation through two failure modes.<n>We propose TA-GRPO (Transform-Augmented GRPO), which generates semantically equivalent transformed variants of each question.<n>This pooled computation ensures mixed rewards even when the original question is too easy or too hard, while training on diverse phrasings promotes multiple solution strategies.
arXiv Detail & Related papers (2026-01-30T02:43:29Z) - DA-DPO: Cost-efficient Difficulty-aware Preference Optimization for Reducing MLLM Hallucinations [22.299736215070343]
Multimodal Large Language Models (MLLMs) tend to overemphasize easily distinguishable preference pairs.<n>We propose Difficulty-Aware Direct Preference Optimization (DA-DPO), a cost-effective framework designed to balance the learning process.
arXiv Detail & Related papers (2026-01-02T09:41:54Z) - DaGRPO: Rectifying Gradient Conflict in Reasoning via Distinctiveness-Aware Group Relative Policy Optimization [20.66452395111739]
We propose Distinctiveness-aware Group Relative Policy Optimization (DaGRPO)<n>DaGRPO incorporates two core mechanisms: (1) Sequence-level Gradient Rectification, which utilizes fine-grained scoring to dynamically mask sample pairs with low distinctiveness; and (2) Off-policy Data Augmentation, which introduces high-quality anchors to recover training signals for challenging tasks.<n>In-depth analysis confirms that DaGRPO effectively mitigates gradient explosion and accelerates the emergence of long-chain reasoning capabilities.
arXiv Detail & Related papers (2025-12-06T07:51:36Z) - Soft Adaptive Policy Optimization [67.61886077470528]
Reinforcement learning plays an increasingly important role in enhancing the reasoning capabilities of large language models.<n>Existing group-based policy optimization methods, such as GSPO and GRPO, alleviate this problem via hard clipping.<n>We propose Soft Adaptive Policy Optimization (SAPO), which replaces hard clipping with a smooth, temperature-controlled gate.
arXiv Detail & Related papers (2025-11-25T14:25:19Z) - FlowRL: Matching Reward Distributions for LLM Reasoning [69.88820066093798]
We propose FlowRL: matching the full reward distribution via flow balancing instead of maximizing rewards in large language model (LLM) reinforcement learning (RL)<n>We transform scalar rewards into a normalized target distribution using a learnable partition function, and then minimize the reverse KL divergence between the policy and the target distribution.
arXiv Detail & Related papers (2025-09-18T17:56:36Z) - EDGE-GRPO: Entropy-Driven GRPO with Guided Error Correction for Advantage Diversity [7.818698554631196]
Group Relative Policy Optimization (GRPO) algorithm relies on sparse reward rules, leading to the advantage collapse problem.<n>We propose the EDGE-GRPO algorithm, which adopts textbfEntropy-textbfDriven Advantage and textbfGuided textbfError Correction to effectively mitigate the problem of advantage collapse.
arXiv Detail & Related papers (2025-07-29T14:23:58Z) - DisCO: Reinforcing Large Reasoning Models with Discriminative Constrained Optimization [50.91849555841057]
Group Relative Policy Optimization is a reinforcement learning method for large reasoning models (LRMs)<n>We introduce a new Discriminative Constrained Optimization framework for reinforcing LRMs, grounded in the principle of discriminative learning.<n>DisCO significantly outperforms GRPO and its improved variants such as DAPO, achieving average gains of 7% over GRPO and 6% over DAPO.
arXiv Detail & Related papers (2025-05-18T11:08:32Z) - GVPO: Group Variance Policy Optimization for Large Language Model Post-Training [19.005045649097987]
Group Variance Policy Optimization (GVPO) incorporates the analytical solution to KL-constrained reward directly into its weights.<n>GVPO offers two key advantages: it guarantees a unique optimal solution, exactly the KL-constrained reward objective, and it supports flexible sampling distributions.<n>By unifying theoretical guarantees with practical adaptability, GVPO establishes a new paradigm for reliable and versatile LLM post-training.
arXiv Detail & Related papers (2025-04-28T09:02:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.