Related papers: VADE: Variance-Aware Dynamic Sampling via Online Sample-Level Difficulty Estimation for Multimodal RL

VADE: Variance-Aware Dynamic Sampling via Online Sample-Level Difficulty Estimation for Multimodal RL

URL: http://arxiv.org/abs/2511.18902v1
Date: Mon, 24 Nov 2025 08:59:54 GMT
Title: VADE: Variance-Aware Dynamic Sampling via Online Sample-Level Difficulty Estimation for Multimodal RL
Authors: Zengjie Hu, Jiantao Qiu, Tianyi Bai, Haojin Yang, Binhang Yuan, Qi Jing, Conghui He, Wentao Zhang,
Abstract summary: Group-based policy optimization methods like GRPO and GSPO have become standard for training multimodal models.<n>They suffer from a critical emphgradient vanishing problem when all responses within a group receive identical rewards.<n>We propose textbfVADE, a sampling framework via online sample-level difficulty textbfEstimation.
Score: 38.782188833641676
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Group-based policy optimization methods like GRPO and GSPO have become standard for training multimodal models, leveraging group-wise rollouts and relative advantage estimation. However, they suffer from a critical \emph{gradient vanishing} problem when all responses within a group receive identical rewards, causing advantage estimates to collapse and training signals to diminish. Existing attempts to mitigate this issue fall into two paradigms: filtering-based and sampling-based methods. Filtering-based methods first generate rollouts broadly and then retroactively filter out uninformative groups, leading to substantial computational overhead. Sampling-based methods proactively select effective samples before rollout but rely on static criteria or prior dataset knowledge, lacking real-time adaptability. To address these issues, we propose \textbf{VADE}, a \textbf{V}ariance-\textbf{A}ware \textbf{D}ynamic sampling framework via online sample-level difficulty \textbf{E}stimation. Our framework integrates three key components: online sample-level difficulty estimation using Beta distributions, a Thompson sampler that maximizes information gain through the estimated correctness probability, and a two-scale prior decay mechanism that maintains robust estimation under policy evolution. This three components design enables VADE to dynamically select the most informative samples, thereby amplifying training signals while eliminating extra rollout costs. Extensive experiments on multimodal reasoning benchmarks show that VADE consistently outperforms strong baselines in both performance and sample efficiency, while achieving a dramatic reduction in computational overhead. More importantly, our framework can serves as a plug-and-play component to be seamlessly integrated into existing group-based RL algorithms. Code and models are available at https://VADE-RL.github.io.

Related papers

GDEPO: Group Dual-dynamic and Equal-right-advantage Policy Optimization with Enhanced Training Data Utilization for Sample-Constrained Reinforcement Learning [14.111530312590531]
Automated Theorem Proving (ATP) represents a fundamental challenge in Artificial Intelligence (AI)<n>We propose Group Dual-dynamic and Equal-right-advantage Policy Optimization (GDEPO)<n>GDEPO incorporates three core mechanisms: 1) dynamic additional sampling, which resamples invalid batches until a valid proof is discovered; 2) equal-right advantage, decoupling the sign of the advantage function from its magnitude (modulated by auxiliary rewards) to ensure stable and correct policy updates; and 3) dynamic additional iterations, applying extra gradient steps to initially failed but eventually successful samples to accelerate learning on challenging cases.
arXiv Detail & Related papers (2026-01-11T07:34:41Z)
Revisiting the Data Sampling in Multimodal Post-training from a Difficulty-Distinguish View [10.95044674432639]
We propose two difficulty-aware sampling strategies for multimodal reasoning.<n>We show that Progressive Image Semantic Masking (PISM) quantifies sample hardness through systematic image degradation.<n>We also show that Cross-Modality Attention Balance (CMAB) assesses cross-modal interaction complexity.
arXiv Detail & Related papers (2025-11-10T05:31:59Z)
Reinforce-Ada: An Adaptive Sampling Framework for Reinforce-Style LLM Training [47.26632817047513]
Reinforcement learning applied to large language models (LLMs) for reasoning tasks is often bottlenecked by unstable gradient estimates.<n>We propose Reinforce-Ada, an adaptive sampling framework for online RL post-training of LLMs.<n>Unlike conventional two-stage allocation methods, Reinforce-Ada interleaves estimation and sampling in an online successive elimination process.
arXiv Detail & Related papers (2025-10-06T16:34:09Z)
MMR1: Enhancing Multimodal Reasoning with Variance-Aware Sampling and Open Resources [113.33902847941941]
Variance-Aware Sampling (VAS) is a data selection strategy guided by Variance Promotion Score (VPS)<n>We release large-scale, carefully curated resources containing 1.6M long CoT cold-start data and 15k RL QA pairs.<n> Experiments across mathematical reasoning benchmarks demonstrate the effectiveness of both the curated data and the proposed VAS.
arXiv Detail & Related papers (2025-09-25T14:58:29Z)
Unearthing Gems from Stones: Policy Optimization with Negative Sample Augmentation for LLM Reasoning [41.83677588934301]
We propose Behavior Constrained Policy Gradient with Negative Sample Augmentation (BCPG-NSA)<n>BCPG-NSA is a fine-grained offline framework that encompasses three stages: 1) sample segmentation, 2) consensus-based step correctness assessment combining LLM and PRM judgers, and 3) policy optimization with NSA designed to effectively mine positive steps within negative samples.<n> Experimental results show that BCPG-NSA outperforms baselines on several challenging math/coding reasoning benchmarks using the same training dataset.
arXiv Detail & Related papers (2025-05-20T14:16:49Z)
Bridging SFT and DPO for Diffusion Model Alignment with Self-Sampling Preference Optimization [67.8738082040299]
Self-Sampling Preference Optimization (SSPO) is a new alignment method for post-training reinforcement learning.<n>SSPO eliminates the need for paired data and reward models while retaining the training stability of SFT.<n>SSPO surpasses all previous approaches on the text-to-image benchmarks and demonstrates outstanding performance on the text-to-video benchmarks.
arXiv Detail & Related papers (2024-10-07T17:56:53Z)
Take the Bull by the Horns: Hard Sample-Reweighted Continual Training Improves LLM Generalization [165.98557106089777]
A key challenge is to enhance the capabilities of large language models (LLMs) amid a looming shortage of high-quality training data. Our study starts from an empirical strategy for the light continual training of LLMs using their original pre-training data sets. We then formalize this strategy into a principled framework of Instance-Reweighted Distributionally Robust Optimization.
arXiv Detail & Related papers (2024-02-22T04:10:57Z)
Distributionally Robust Models with Parametric Likelihood Ratios [123.05074253513935]
Three simple ideas allow us to train models with DRO using a broader class of parametric likelihood ratios. We find that models trained with the resulting parametric adversaries are consistently more robust to subpopulation shifts when compared to other DRO approaches.
arXiv Detail & Related papers (2022-04-13T12:43:12Z)
Rethinking Sampling Strategies for Unsupervised Person Re-identification [59.47536050785886]
We analyze the reasons for the performance differences between various sampling strategies under the same framework and loss function.<n>Group sampling is proposed, which gathers samples from the same class into groups.<n>Experiments on Market-1501, DukeMTMC-reID and MSMT17 show that group sampling achieves performance comparable to state-of-the-art methods.
arXiv Detail & Related papers (2021-07-07T05:39:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.