Group Distributionally Robust Optimization-Driven Reinforcement Learning for LLM Reasoning
- URL: http://arxiv.org/abs/2601.19280v1
- Date: Tue, 27 Jan 2026 07:10:41 GMT
- Title: Group Distributionally Robust Optimization-Driven Reinforcement Learning for LLM Reasoning
- Authors: Kishan Panaganti, Zhenwen Liang, Wenhao Yu, Haitao Mi, Dong Yu,
- Abstract summary: Multi-Ad Distributionally Robust Optimization (GDRO) is an optimization-first framework that moves beyond uniform reasoning.<n>We propose two independent GDRO games for post-training: Prompt-GDRO, which employs an EMA-debiased multiplicative-weight bandit sampler to target the intensive difficulty margin and upweight persistently hard groups without frequency bias; and Rollout-GDRO, which uses a shadow-price controller to reallocate rollouts across groups, maximizing gradient variance reduction on hard tasks under a fixed mean budget (compute-neutral)<n>We validate our framework on the DAPO 14.1k dataset using Q
- Score: 45.86058898829962
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent progress in Large Language Model (LLM) reasoning is increasingly driven by the refinement of post-training loss functions and alignment strategies. However, standard Reinforcement Learning (RL) paradigms like Group Relative Policy Optimization (GRPO) remain constrained by static uniformity: uniform prompt sampling and a fixed number of rollouts per prompt. For heterogeneous, heavy-tailed reasoning data, this creates structural inefficiencies that waste compute on already-solved patterns while under-training the long tail of hard problems. To address this, we propose Multi-Adversary Group Distributionally Robust Optimization (GDRO), an optimization-first framework that moves beyond uniform reasoning models by dynamically adapting the training distribution. We introduce an Online Difficulty Classifier that partitions prompts into dynamic pass@k difficulty groups. We then propose two independent GDRO games for post-training: (1) Prompt-GDRO, which employs an EMA-debiased multiplicative-weights bandit sampler to target the intensive difficulty margin and upweight persistently hard groups without frequency bias; and (2) Rollout-GDRO, which uses a shadow-price controller to reallocate rollouts across groups, maximizing gradient variance reduction on hard tasks under a fixed mean budget (compute-neutral). We provide no-regret guarantees for both controllers and additionally a variance-proxy analysis motivating a square-root optimal rollout allocation for Rollout-GDRO. We validate our framework on the DAPO 14.1k dataset using Qwen3-Base models. Prompt-GDRO and Rollout-GDRO achieve average relative gains of +10.6% and +10.1%, respectively, in pass@8 accuracy across 1.7B, 4B, and 8B scales compared to the GRPO baseline. Qualitative analysis shows an emergent curriculum: the adversaries shift resources to the evolving reasoning frontier, enhancing the reasoning model's performance.
Related papers
- Unbiased Dynamic Pruning for Efficient Group-Based Policy Optimization [60.87651283510059]
Group Relative Policy Optimization (GRPO) effectively scales LLM reasoning but incurs prohibitive computational costs.<n>We propose Dynamic Pruning Policy Optimization (DPPO), a framework that enables dynamic pruning while preserving unbiased gradient estimation.<n>To mitigate the data sparsity induced by pruning, we introduce Dense Prompt Packing, a window-based greedy strategy.
arXiv Detail & Related papers (2026-03-04T14:48:53Z) - GDRO: Group-level Reward Post-training Suitable for Diffusion Models [55.948229011478304]
Group-level rewards successfully align the model with the targeted reward.<n>Group-level Direct Reward Optimization (GDRO) is a new post-training paradigm for group-level reward alignment.<n>GDRO supports full offline training that saves the large time cost for image rollout sampling.<n>It is diffusion-sampler-independent, which eliminates the need for the ODE-to-SDE approximation to obtainity.
arXiv Detail & Related papers (2026-01-05T11:47:18Z) - DaGRPO: Rectifying Gradient Conflict in Reasoning via Distinctiveness-Aware Group Relative Policy Optimization [20.66452395111739]
We propose Distinctiveness-aware Group Relative Policy Optimization (DaGRPO)<n>DaGRPO incorporates two core mechanisms: (1) Sequence-level Gradient Rectification, which utilizes fine-grained scoring to dynamically mask sample pairs with low distinctiveness; and (2) Off-policy Data Augmentation, which introduces high-quality anchors to recover training signals for challenging tasks.<n>In-depth analysis confirms that DaGRPO effectively mitigates gradient explosion and accelerates the emergence of long-chain reasoning capabilities.
arXiv Detail & Related papers (2025-12-06T07:51:36Z) - Adaptive Sample-Level Framework Motivated by Distributionally Robust Optimization with Variance-Based Radius Assignment for Enhanced Neural Network Generalization Under Distribution Shift [0.8101875496469488]
Distribution shifts and minority subpopulations frequently undermine the reliability of deep neural networks trained using Empirical Risk Minimization (ERM)<n>We propose a variance-driven, sample-level DRO framework that automatically identifies high-risk training samples and assigns a personalized robustness budget to each based on its online loss variance.
arXiv Detail & Related papers (2025-11-04T10:20:21Z) - DisCO: Reinforcing Large Reasoning Models with Discriminative Constrained Optimization [50.91849555841057]
Group Relative Policy Optimization is a reinforcement learning method for large reasoning models (LRMs)<n>We introduce a new Discriminative Constrained Optimization framework for reinforcing LRMs, grounded in the principle of discriminative learning.<n>DisCO significantly outperforms GRPO and its improved variants such as DAPO, achieving average gains of 7% over GRPO and 6% over DAPO.
arXiv Detail & Related papers (2025-05-18T11:08:32Z) - Modeling the Q-Diversity in a Min-max Play Game for Robust Optimization [61.39201891894024]
Group distributionally robust optimization (group DRO) can minimize the worst-case loss over pre-defined groups.
We reformulate the group DRO framework by proposing Q-Diversity.
Characterized by an interactive training mode, Q-Diversity relaxes the group identification from annotation into direct parameterization.
arXiv Detail & Related papers (2023-05-20T07:02:27Z) - Examining and Combating Spurious Features under Distribution Shift [94.31956965507085]
We define and analyze robust and spurious representations using the information-theoretic concept of minimal sufficient statistics.
We prove that even when there is only bias of the input distribution, models can still pick up spurious features from their training data.
Inspired by our analysis, we demonstrate that group DRO can fail when groups do not directly account for various spurious correlations.
arXiv Detail & Related papers (2021-06-14T05:39:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.