Buffer Matters: Unleashing the Power of Off-Policy Reinforcement Learning in Large Language Model Reasoning
- URL: http://arxiv.org/abs/2602.20722v1
- Date: Tue, 24 Feb 2026 09:35:43 GMT
- Title: Buffer Matters: Unleashing the Power of Off-Policy Reinforcement Learning in Large Language Model Reasoning
- Authors: Xu Wan, Yansheng Wang, Wenqi Huang, Mingyang Sun,
- Abstract summary: Batch Adaptation Policy Optimization (BAPO) is an off-policy RLVR framework to improve the data efficiency in large language models post-training.<n>It dynamically selects training batches by re-evaluating historically difficult samples and reusing high-quality ones.<n>BAPO achieves an average 12.5% improvement over GRPO across mathematics, planning, and visual reasoning tasks.
- Score: 12.863583402455008
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Traditional on-policy Reinforcement Learning with Verifiable Rewards (RLVR) frameworks suffer from experience waste and reward homogeneity, which directly hinders learning efficiency on difficult samples during large language models post-training. In this paper, we introduce Batch Adaptation Policy Optimization (BAPO), an off-policy RLVR framework to improve the data efficiency in large language models post-training. It dynamically selects training batches by re-evaluating historically difficult samples and reusing high-quality ones, while holding a lower bound guarantee for policy improvement. Extensive experiments further demonstrate that BAPO achieves an average 12.5% improvement over GRPO across mathematics, planning, and visual reasoning tasks. Crucially, BAPO successfully resolves 40.7% of problems that base models consistently fail to solve.
Related papers
- Actor-Curator: Co-adaptive Curriculum Learning via Policy-Improvement Bandits for RL Post-Training [63.34044358216334]
ACTOR-CURATOR is a scalable and fully automated curriculum learning framework for reinforcement learning post-training of large language models.<n> Empirically, ACTOR-CURATOR consistently outperforms uniform sampling and strong curriculum baselines.
arXiv Detail & Related papers (2026-02-24T04:19:48Z) - RePO: Bridging On-Policy Learning and Off-Policy Knowledge through Rephrasing Policy Optimization [40.41228010377401]
We propose Rephrasing Policy Optimization (RePO) to reconcile off-policy knowledge with the stability of on-policy RL.<n>RePO rephrases off-policy knowledge into trajectories that conform to its own stylistic and parametric distribution.<n> Experiments on several benchmarks demonstrate that RePO improves hard-sample utilization and outperforms existing baselines.
arXiv Detail & Related papers (2026-02-11T13:02:40Z) - Scaf-GRPO: Scaffolded Group Relative Policy Optimization for Enhancing LLM Reasoning [49.290631188365786]
Scaf-GRPO is a training framework that intervenes when a model's independent learning has plateaued.<n>It boosts the pass@1 score of the Qwen2.5-Math-7B model by a relative 44.3% over a vanilla GRPO baseline.<n>This result demonstrates our framework provides a robust and effective methodology for unlocking a model's ability to solve problems previously beyond its reach.
arXiv Detail & Related papers (2025-10-22T17:41:30Z) - DARO: Difficulty-Aware Reweighting Policy Optimization [18.07946696398167]
Group Relative Policy Optimization ( GRPO) has emerged as the de facto approach for Reinforcement Learning with Verifiable Rewards (RLVR)<n>We provide a unified view, demonstrating that their reliance on static or overly simplistic weighting schemes tied to sample difficulty prevents adaptation to a model's evolving capabilities.<n>We introduce bfbfDifficulty-Aware Reweighting Policy Optimization (DARO), a method that dynamically adjusts the loss contribution of each difficulty group based on the model's learning state.
arXiv Detail & Related papers (2025-10-10T04:57:15Z) - ExGRPO: Learning to Reason from Experience [82.83309610498446]
Reinforcement learning from verifiable rewards (RLVR) is an emerging paradigm for improving the reasoning ability of large language models.<n>Standard on-policy training discards rollout experiences after a single update, leading to computational inefficiency and instability.<n>In this paper, we are the first to investigate what makes a reasoning experience valuable and identify rollout correctness and entropy as effective indicators of experience value.
arXiv Detail & Related papers (2025-10-02T17:31:30Z) - GHPO: Adaptive Guidance for Stable and Efficient LLM Reinforcement Learning [15.43938821214447]
Reinforcement Learning with Verifiable Rewards (RLVR) has recently emerged as a powerful paradigm for facilitating the self-improvement of large language models (LLMs)<n>This paper introduces Guided Hybrid Policy Optimization (GHPO), a novel difficulty-aware reinforcement learning framework.<n>GHPO dynamically calibrates task difficulty by employing adaptive prompt refinement to provide targeted guidance.
arXiv Detail & Related papers (2025-07-14T08:10:00Z) - Perception-Aware Policy Optimization for Multimodal Reasoning [79.56070395437898]
A major source of error in current multimodal reasoning lies in the perception of visual inputs.<n>We propose PAPO, a novel policy gradient algorithm that encourages the model to learn to perceive while learning to reason.<n>We observe a substantial reduction of 30.5% in perception errors, indicating improved perceptual capabilities with PAPO.
arXiv Detail & Related papers (2025-07-08T23:22:34Z) - On-Policy RL with Optimal Reward Baseline [109.47676554514193]
On-Policy RL with Optimal reward baseline (OPO) is a novel and simplified reinforcement learning algorithm.<n>OPO emphasizes the importance of exact on-policy training, which empirically stabilizes the training process and enhances exploration.<n>Results demonstrate OPO's superior performance and training stability without additional models or regularization terms.
arXiv Detail & Related papers (2025-05-29T15:58:04Z) - Dynamic Loss-Based Sample Reweighting for Improved Large Language Model Pretraining [55.262510814326035]
Existing reweighting strategies primarily focus on group-level data importance.<n>We introduce novel algorithms for dynamic, instance-level data reweighting.<n>Our framework allows us to devise reweighting strategies deprioritizing redundant or uninformative data.
arXiv Detail & Related papers (2025-02-10T17:57:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.