Jackpot: Optimal Budgeted Rejection Sampling for Extreme Actor-Policy Mismatch Reinforcement Learning
- URL: http://arxiv.org/abs/2602.06107v1
- Date: Thu, 05 Feb 2026 18:57:01 GMT
- Title: Jackpot: Optimal Budgeted Rejection Sampling for Extreme Actor-Policy Mismatch Reinforcement Learning
- Authors: Zhuoming Chen, Hongyi Liu, Yang Zhou, Haizhong Zheng, Beidi Chen,
- Abstract summary: Jackpot is a framework that integrates a unified training objective that jointly updates the policy and rollout models.<n>Our theoretical analysis shows that OBRS consistently moves the rollout distribution closer to the target distribution under a controllable acceptance budget.
- Score: 31.843129392507716
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Reinforcement learning (RL) for large language models (LLMs) remains expensive, particularly because the rollout is expensive. Decoupling rollout generation from policy optimization (e.g., leveraging a more efficient model to rollout) could enable substantial efficiency gains, yet doing so introduces a severe distribution mismatch that destabilizes learning. We propose Jackpot, a framework that leverages Optimal Budget Rejection Sampling (OBRS) to directly reduce the discrepancy between the rollout model and the evolving policy. Jackpot integrates a principled OBRS procedure, a unified training objective that jointly updates the policy and rollout models, and an efficient system implementation enabled by top-$k$ probability estimation and batch-level bias correction. Our theoretical analysis shows that OBRS consistently moves the rollout distribution closer to the target distribution under a controllable acceptance budget. Empirically, \sys substantially improves training stability compared to importance-sampling baselines, achieving performance comparable to on-policy RL when training Qwen3-8B-Base for up to 300 update steps of batchsize 64. Taken together, our results show that OBRS-based alignment brings us a step closer to practical and effective decoupling of rollout generation from policy optimization for RL for LLMs.
Related papers
- Unbiased Dynamic Pruning for Efficient Group-Based Policy Optimization [60.87651283510059]
Group Relative Policy Optimization (GRPO) effectively scales LLM reasoning but incurs prohibitive computational costs.<n>We propose Dynamic Pruning Policy Optimization (DPPO), a framework that enables dynamic pruning while preserving unbiased gradient estimation.<n>To mitigate the data sparsity induced by pruning, we introduce Dense Prompt Packing, a window-based greedy strategy.
arXiv Detail & Related papers (2026-03-04T14:48:53Z) - Rethinking the Trust Region in LLM Reinforcement Learning [72.25890308541334]
Proximal Policy Optimization (PPO) serves as the de facto standard algorithm for Large Language Models (LLMs)<n>We propose Divergence Proximal Policy Optimization (DPPO), which substitutes clipping with a more principled constraint.<n>DPPO achieves superior training and efficiency compared to existing methods, offering a more robust foundation for RL-based fine-tuning.
arXiv Detail & Related papers (2026-02-04T18:59:04Z) - CoBA-RL: Capability-Oriented Budget Allocation for Reinforcement Learning in LLMs [31.371566320424552]
CoBA-RL is a reinforcement learning algorithm designed to adaptively allocate rollout budgets based on the model's evolving capability.<n>Our approach effectively orchestrates the trade-off between exploration and exploitation, delivering consistent generalization improvements.
arXiv Detail & Related papers (2026-02-03T03:14:36Z) - Adaptive Rollout Allocation for Online Reinforcement Learning with Verifiable Rewards [26.5011687714416]
Existing group-based policy optimization methods allocate a fixed number of rollouts for all training prompts.<n>This uniform allocation treats all prompts as equally informative, and could lead to inefficient computational budget usage and impede training progress.<n>We introduce VIP, a Variance-Informed Predictive allocation strategy that allocates a given rollout budget to the prompts in the incumbent batch to minimize the expected gradient variance of the policy update.
arXiv Detail & Related papers (2026-02-02T03:50:01Z) - A Step Back: Prefix Importance Ratio Stabilizes Policy Optimization [58.116300485427764]
Reinforcement learning post-training can elicit reasoning behaviors in large language models.<n> token-level correction often leads to unstable training dynamics when the degree of off-policyness is large.<n>We propose a simple yet effective objective, Minimum Prefix Ratio (MinPRO)
arXiv Detail & Related papers (2026-01-30T08:47:19Z) - BAPO: Stabilizing Off-Policy Reinforcement Learning for LLMs via Balanced Policy Optimization with Adaptive Clipping [69.74252624161652]
We propose BAlanced Policy Optimization with Adaptive Clipping (BAPO)<n>BAPO dynamically adjusts clipping bounds to adaptively re-balance positive and negative contributions, preserve entropy, and stabilize RL optimization.<n>On AIME 2024 and AIME 2025 benchmarks, our 7B BAPO model surpasses open-source counterparts such as SkyWork-OR1-7B.
arXiv Detail & Related papers (2025-10-21T12:55:04Z) - Enhancing Reasoning for Diffusion LLMs via Distribution Matching Policy Optimization [44.14678335188207]
Diffusion large language models (dLLMs) are promising alternatives to autoregressive large language models (AR-LLMs)<n>Reinforcement learning (RL) is a crucial component for dLLMs to achieve comparable performance with AR-LLMs on important tasks, such as reasoning.<n>This paper proposes Distribution Matching Policy Optimization (DMPO), a principled and theoretically grounded RL fine-tuning method.
arXiv Detail & Related papers (2025-10-09T13:59:50Z) - Accelerating RL for LLM Reasoning with Optimal Advantage Regression [52.0792918455501]
We propose a novel two-stage policy optimization framework that directly approximates the optimal advantage function.<n>$A$*-PO achieves competitive performance across a wide range of mathematical reasoning benchmarks.<n>It reduces training time by up to 2$times$ and peak memory usage by over 30% compared to PPO, GRPO, and REBEL.
arXiv Detail & Related papers (2025-05-27T03:58:50Z) - Not All Rollouts are Useful: Down-Sampling Rollouts in LLM Reinforcement Learning [55.15106182268834]
Reinforcement learning with verifiable rewards (RLVR) has emerged as the leading approach for enhancing reasoning capabilities in large language models.<n>It faces a fundamental compute and memory asymmetry: rollout generation is embarrassingly parallel and memory-light, whereas policy updates are communication-heavy and memory-intensive.<n>We introduce PODS (Policy Optimization with Down-Sampling), which decouples rollout generation from policy updates by training only on a strategically selected subset of rollouts.
arXiv Detail & Related papers (2025-04-18T17:49:55Z) - AlphaDPO: Adaptive Reward Margin for Direct Preference Optimization [45.46582930202524]
$alpha$-DPO is an adaptive preference optimization algorithm for large language models.<n>It balances the policy model and the reference model to achieve personalized reward margins.<n>It consistently outperforms DPO and SimPO across various model settings.
arXiv Detail & Related papers (2024-10-14T04:29:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.