Efficient RLVR Training via Weighted Mutual Information Data Selection
- URL: http://arxiv.org/abs/2603.01907v1
- Date: Mon, 02 Mar 2026 14:25:07 GMT
- Title: Efficient RLVR Training via Weighted Mutual Information Data Selection
- Authors: Xinyu Zhou, Boyu Zhu, Haotian Zhang, Huiming Wang, Zhijiang Guo,
- Abstract summary: Reinforcement learning (RL) plays a central role in improving the reasoning and alignment of large language models.<n>We introduce InSight, an INformation-guided data SamplInG metHod for RL Training, grounded in a weighted mutual information objective.<n>We show that expected uncertainty reduction decomposes into complementary difficulty- and evidence-dependent components, revealing a fundamental limitation of difficulty-only selection.
- Score: 30.408074538619626
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Reinforcement learning (RL) plays a central role in improving the reasoning and alignment of large language models, yet its efficiency critically depends on how training data are selected. Existing online selection strategies predominantly rely on difficulty-based heuristics, favouring datapoints with intermediate success rates, implicitly equating difficulty with informativeness and neglecting epistemic uncertainty arising from limited evidence. We introduce InSight, an INformation-guided data SamplInG metHod for RL Training, grounded in a weighted mutual information objective. By modeling data outcomes with Bayesian latent success rates, we show that expected uncertainty reduction decomposes into complementary difficulty- and evidence-dependent components, revealing a fundamental limitation of difficulty-only selection. Leveraging this observation, InSight constructs a stable acquisition score based on the mean belief of datapoints' success rather than noisy sampled outcomes, and naturally extends to multi-rollout settings common in reinforcement learning with verifiable rewards (RLVR). Extensive experiments demonstrate that InSight consistently achieves state-of-the-art performance and improves training efficiency, including a +1.41 average gain on Planning & Mathmatics benchmarks, +1.01 improvement on general reasoning, and up to ~2.2x acceleration, with negligible additional computational overhead.
Related papers
- SG-OIF: A Stability-Guided Online Influence Framework for Reliable Vision Data [6.4391040754741296]
In this paper, we introduce a Stability-Guided Online Influence Framework (SG-OIF) for Approximating training-point influence on test predictions.<n>We show that SG-OIF achieves 91.1% accuracy in the top 1% prediction samples on the CIFAR-10, and 99.8% AUPR score on MNIST.
arXiv Detail & Related papers (2025-11-21T19:58:54Z) - Efficient Reinforcement Learning for Large Language Models with Intrinsic Exploration [33.02780998281276]
Reinforcement learning with verifiable rewards (RLVR) has improved the reasoning ability of large language models.<n>This study investigates how simply leveraging intrinsic data properties, almost free benefit during training, can improve data efficiency for RLVR.
arXiv Detail & Related papers (2025-11-02T04:16:47Z) - CurES: From Gradient Analysis to Efficient Curriculum Learning for Reasoning LLMs [53.749193998004166]
Curriculum learning plays a crucial role in enhancing the training efficiency of large language models.<n>We propose CurES, an efficient training method that accelerates convergence and employs Bayesian posterior estimation to minimize computational overhead.
arXiv Detail & Related papers (2025-10-01T15:41:27Z) - Towards High Data Efficiency in Reinforcement Learning with Verifiable Reward [54.708851958671794]
We propose a Data-Efficient Policy Optimization pipeline that combines optimized strategies for both offline and online data selection.<n>In offline phase, we curate a high-quality subset of training samples based on diversity, influence, and appropriate difficulty.<n>During online RLVR training, we introduce a sample-level explorability metric to dynamically filter samples with low exploration potential.
arXiv Detail & Related papers (2025-09-01T10:04:20Z) - Dissecting Long-Chain-of-Thought Reasoning Models: An Empirical Study [91.78803511141975]
This work focuses on the roles of positive and negative samples in scaling reinforcement learning.<n>We identify substantial data inefficiency in group relative policy optimization, where over half of the samples yield zero advantage.<n>We investigate unstable performance across various reasoning models and benchmarks, attributing instability to uncertain problems with ambiguous outcomes.
arXiv Detail & Related papers (2025-06-05T11:47:10Z) - Unlocking the Potential of Difficulty Prior in RL-based Multimodal Reasoning [69.64809103333839]
We investigate how explicitly modeling problem's difficulty prior information shapes the effectiveness of reinforcement learning based fine-tuning for multimodal reasoning.<n>Our approach demonstrates significant performances across various multi-modal mathematical reasoning benchmarks with only 2K+0.6K two-stage training data.
arXiv Detail & Related papers (2025-05-19T15:43:10Z) - Robust Federated Learning with Confidence-Weighted Filtering and GAN-Based Completion under Noisy and Incomplete Data [0.0]
Federated learning (FL) presents an effective solution for collaborative model training while maintaining data privacy across decentralized client datasets.<n>This study proposes a federated learning methodology that systematically addresses data quality issues, including noise, class imbalance, and missing labels.<n>Our results indicate that this method effectively mitigates common data quality challenges, providing a robust, scalable, and privacy compliant solution.
arXiv Detail & Related papers (2025-05-14T18:49:18Z) - NoisyRollout: Reinforcing Visual Reasoning with Data Augmentation [66.36912000442608]
NoisyRollout is a simple yet effective data augmentation method.<n>It mixes training trajectories from both clean and moderately distorted images.<n>It achieves state-of-the-art performance among open-source RL-tuned models.
arXiv Detail & Related papers (2025-04-17T16:10:13Z) - SeMi: When Imbalanced Semi-Supervised Learning Meets Mining Hard Examples [54.760757107700755]
Semi-Supervised Learning (SSL) can leverage abundant unlabeled data to boost model performance.<n>The class-imbalanced data distribution in real-world scenarios poses great challenges to SSL, resulting in performance degradation.<n>We propose a method that enhances the performance of Imbalanced Semi-Supervised Learning by Mining Hard Examples (SeMi)
arXiv Detail & Related papers (2025-01-10T14:35:16Z) - Robust Learning with Progressive Data Expansion Against Spurious
Correlation [65.83104529677234]
We study the learning process of a two-layer nonlinear convolutional neural network in the presence of spurious features.
Our analysis suggests that imbalanced data groups and easily learnable spurious features can lead to the dominance of spurious features during the learning process.
We propose a new training algorithm called PDE that efficiently enhances the model's robustness for a better worst-group performance.
arXiv Detail & Related papers (2023-06-08T05:44:06Z) - Stabilizing and Improving Federated Learning with Non-IID Data and
Client Dropout [15.569507252445144]
Label distribution skew induced data heterogeniety has been shown to be a significant obstacle that limits the model performance in federated learning.
We propose a simple yet effective framework by introducing a prior-calibrated softmax function for computing the cross-entropy loss.
The improved model performance over existing baselines in the presence of non-IID data and client dropout is demonstrated.
arXiv Detail & Related papers (2023-03-11T05:17:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.