Contextual Rollout Bandits for Reinforcement Learning with Verifiable Rewards
- URL: http://arxiv.org/abs/2602.08499v1
- Date: Mon, 09 Feb 2026 10:51:58 GMT
- Title: Contextual Rollout Bandits for Reinforcement Learning with Verifiable Rewards
- Authors: Xiaodong Lu, Xiaohan Wang, Jiajun Chai, Guojun Yin, Wei Lin, Zhijun Chen, Yu Luo, Fuzhen Zhuang, Yikun Ban, Deqing Wang,
- Abstract summary: Reinforcement Learning with Verifiable Rewards (RLVR) is an effective paradigm for improving the reasoning capabilities of large language models.<n>We propose a unified neural scheduling framework that adaptively selects high-value rollouts throughout training.<n>Experiments on six mathematical reasoning benchmarks demonstrate consistent gains in performance and training efficiency across multiple RLVR optimization methods.
- Score: 69.74686029941881
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) is an effective paradigm for improving the reasoning capabilities of large language models. However, existing RLVR methods utilize rollouts in an indiscriminate and short-horizon manner: responses of heterogeneous quality within each prompt are treated uniformly, and historical rollouts are discarded after a single use. This leads to noisy supervision, poor sample efficiency, and suboptimal policy updates. We address these issues by formulating rollout scheduling in RLVR as a contextual bandit problem and proposing a unified neural scheduling framework that adaptively selects high-value rollouts throughout training. Each rollout is treated as an arm whose reward is defined by the induced performance gain between consecutive optimization steps. The resulting scheduler supports both noise-aware intra-group selection and adaptive global reuse of historical rollouts within a single principled framework. We provide theoretical justification by deriving sublinear regret bounds and showing that enlarging the rollout buffer improves the achievable performance upper bound. Experiments on six mathematical reasoning benchmarks demonstrate consistent gains in performance and training efficiency across multiple RLVR optimization methods.
Related papers
- Learning in Context, Guided by Choice: A Reward-Free Paradigm for Reinforcement Learning with Transformers [55.33468902405567]
We propose a new learning paradigm, In-Context Preference-based Reinforcement Learning (ICPRL), in which both pretraining and deployment rely solely on preference feedback.<n>ICPRL enables strong in-context generalization to unseen tasks, achieving performance comparable to ICRL methods trained with full reward supervision.
arXiv Detail & Related papers (2026-02-09T03:42:16Z) - Alternating Reinforcement Learning for Rubric-Based Reward Modeling in Non-Verifiable LLM Post-Training [29.56905427210088]
gradient-ARM is a framework that jointly optimize a rubric generator and a judge using reinforcement learning from preference feedback.<n>We show that gradient-ARM achieves state-of-the-art performance among baselines on benchmarks and significantly improves downstream policy alignment in both offline and online reinforcement learning settings.
arXiv Detail & Related papers (2026-02-02T00:50:53Z) - Stable and Efficient Single-Rollout RL for Multimodal Reasoning [66.53652874617217]
$textbfMSSR$ (Multimodal Stabilized Single-Rollout) is a group-free RLVR framework that achieves both stable optimization and effective multimodal reasoning performance.<n>In in-distribution evaluations, MSSR demonstrates superior training compute efficiency, achieving similar validation accuracy to the group-based baseline with half the training steps.
arXiv Detail & Related papers (2025-12-20T05:07:53Z) - Efficient Reinforcement Learning for Large Language Models with Intrinsic Exploration [33.02780998281276]
Reinforcement learning with verifiable rewards (RLVR) has improved the reasoning ability of large language models.<n>This study investigates how simply leveraging intrinsic data properties, almost free benefit during training, can improve data efficiency for RLVR.
arXiv Detail & Related papers (2025-11-02T04:16:47Z) - Shuffle-R1: Efficient RL framework for Multimodal Large Language Models via Data-centric Dynamic Shuffle [65.14124923451077]
Reinforcement learning (RL) has emerged as an effective post-training paradigm for enhancing the reasoning capabilities of multimodal large language model (MLLM)<n>However, current RL pipelines often suffer from training inefficiencies caused by two underexplored issues: Advantage Collapsing and Rollout Silencing.<n>We propose Shuffle-R1, a simple yet principled framework that improves RL fine-tuning efficiency by dynamically restructuring trajectory sampling and batch composition.
arXiv Detail & Related papers (2025-08-07T17:53:47Z) - COPO: Consistency-Aware Policy Optimization [17.328515578426227]
Reinforcement learning has significantly enhanced the reasoning capabilities of Large Language Models (LLMs) in complex problem-solving tasks.<n>Recently, the introduction of DeepSeek R1 has inspired a surge of interest in leveraging rule-based rewards as a low-cost alternative for computing advantage functions and guiding policy optimization.<n>We propose a consistency-aware policy optimization framework that introduces a structured global reward based on outcome consistency.
arXiv Detail & Related papers (2025-08-06T07:05:18Z) - Effective Inference-Free Retrieval for Learned Sparse Representations [19.54810957623511]
Learned Sparse Retrieval (LSR) is an effective IR approach that exploits pre-trained language models for encoding text into a learned bag of words.<n>Recently, new efficient -- inverted index-based -- retrieval engines have been proposed, leading to a natural question: has the role of regularization changed in training LSR models?<n>We show that regularization can be relaxed to produce more effective LSR encoders.
arXiv Detail & Related papers (2025-04-30T09:10:46Z) - Not All Rollouts are Useful: Down-Sampling Rollouts in LLM Reinforcement Learning [55.15106182268834]
Reinforcement learning with verifiable rewards (RLVR) has emerged as the leading approach for enhancing reasoning capabilities in large language models.<n>It faces a fundamental compute and memory asymmetry: rollout generation is embarrassingly parallel and memory-light, whereas policy updates are communication-heavy and memory-intensive.<n>We introduce PODS (Policy Optimization with Down-Sampling), which decouples rollout generation from policy updates by training only on a strategically selected subset of rollouts.
arXiv Detail & Related papers (2025-04-18T17:49:55Z) - False Correlation Reduction for Offline Reinforcement Learning [115.11954432080749]
We propose falSe COrrelation REduction (SCORE) for offline RL, a practically effective and theoretically provable algorithm.
We empirically show that SCORE achieves the SoTA performance with 3.1x acceleration on various tasks in a standard benchmark (D4RL)
arXiv Detail & Related papers (2021-10-24T15:34:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.