BAPO: Stabilizing Off-Policy Reinforcement Learning for LLMs via Balanced Policy Optimization with Adaptive Clipping
- URL: http://arxiv.org/abs/2510.18927v1
- Date: Tue, 21 Oct 2025 12:55:04 GMT
- Title: BAPO: Stabilizing Off-Policy Reinforcement Learning for LLMs via Balanced Policy Optimization with Adaptive Clipping
- Authors: Zhiheng Xi, Xin Guo, Yang Nan, Enyu Zhou, Junrui Shen, Wenxiang Chen, Jiaqi Liu, Jixuan Huang, Zhihao Zhang, Honglin Guo, Xun Deng, Zhikai Lei, Miao Zheng, Guoteng Wang, Shuo Zhang, Peng Sun, Rui Zheng, Hang Yan, Tao Gui, Qi Zhang, Xuanjing Huang,
- Abstract summary: We propose BAlanced Policy Optimization with Adaptive Clipping (BAPO)<n>BAPO dynamically adjusts clipping bounds to adaptively re-balance positive and negative contributions, preserve entropy, and stabilize RL optimization.<n>On AIME 2024 and AIME 2025 benchmarks, our 7B BAPO model surpasses open-source counterparts such as SkyWork-OR1-7B.
- Score: 69.74252624161652
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Reinforcement learning (RL) has recently become the core paradigm for aligning and strengthening large language models (LLMs). Yet, applying RL in off-policy settings--where stale data from past policies are used for training--improves sample efficiency, but remains challenging: policy entropy declines sharply, optimization often becomes unstable and may even collapse. Through theoretical and empirical analysis, we identify two key insights: (i) an imbalance in optimization, where negative-advantage samples dominate the policy gradient, suppressing useful behaviors and risking gradient explosions; and (ii) the derived Entropy-Clip Rule, which reveals that the fixed clipping mechanism in PPO-like objectives systematically blocks entropy-increasing updates, thereby driving the policy toward over-exploitation at the expense of exploration. Building on these insights, we propose BAlanced Policy Optimization with Adaptive Clipping (BAPO), a simple yet effective method that dynamically adjusts clipping bounds to adaptively re-balance positive and negative contributions, preserve entropy, and stabilize RL optimization. Across diverse off-policy scenarios--including sample replay and partial rollout--BAPO achieves fast, stable, and data-efficient training. On AIME 2024 and AIME 2025 benchmarks, our 7B BAPO model surpasses open-source counterparts such as SkyWork-OR1-7B, while our 32B BAPO model not only achieves state-of-the-art results among models of the same scale but also outperforms leading proprietary systems like o3-mini and Gemini-2.5-Flash-Thinking.
Related papers
- Stabilizing Policy Optimization via Logits Convexity [59.242732612484474]
We show that the convexity of the supervised fine-tuning loss with respect to model logits plays a key role in enabling stable training.<n>Motivated by this observation, we propose Logits Convex Optimization (LCO), a simple yet effective policy optimization framework.
arXiv Detail & Related papers (2026-03-01T07:40:12Z) - Rethinking the Trust Region in LLM Reinforcement Learning [72.25890308541334]
Proximal Policy Optimization (PPO) serves as the de facto standard algorithm for Large Language Models (LLMs)<n>We propose Divergence Proximal Policy Optimization (DPPO), which substitutes clipping with a more principled constraint.<n>DPPO achieves superior training and efficiency compared to existing methods, offering a more robust foundation for RL-based fine-tuning.
arXiv Detail & Related papers (2026-02-04T18:59:04Z) - A Step Back: Prefix Importance Ratio Stabilizes Policy Optimization [58.116300485427764]
Reinforcement learning post-training can elicit reasoning behaviors in large language models.<n> token-level correction often leads to unstable training dynamics when the degree of off-policyness is large.<n>We propose a simple yet effective objective, Minimum Prefix Ratio (MinPRO)
arXiv Detail & Related papers (2026-01-30T08:47:19Z) - Ratio-Variance Regularized Policy Optimization for Efficient LLM Fine-tuning [48.34492357368989]
We propose a primal-dual framework that supports stable on-policy learning and enables principled off-policy data reuse.<n>$R2VPO$ achieves superior performance with average relative gains of up to 17% over strong clipping-based baselines.
arXiv Detail & Related papers (2026-01-06T14:01:42Z) - Reinforcement Fine-Tuning of Flow-Matching Policies for Vision-Language-Action Models [7.316631310935769]
Vision-Language-Action (VLA) models have shown strong generalization by leveraging large-scale demonstrations.<n>We propose Flow Policy Optimization (FPO) algorithm, which reformulates importance sampling by leveraging per-sample changes in the conditional flow-matching objective.<n>We evaluate FPO on the LIBERO benchmark and the ALOHA simulation task against supervised, preference-aligned, diffusion-based, autoregressive online RL.
arXiv Detail & Related papers (2025-10-11T03:11:18Z) - Stabilizing Policy Gradients for Sample-Efficient Reinforcement Learning in LLM Reasoning [77.92320830700797]
Reinforcement Learning has played a central role in enabling reasoning capabilities of Large Language Models.<n>We propose a tractable computational framework that tracks and leverages curvature information during policy updates.<n>The algorithm, Curvature-Aware Policy Optimization (CAPO), identifies samples that contribute to unstable updates and masks them out.
arXiv Detail & Related papers (2025-10-01T12:29:32Z) - ACPO: Adaptive Curriculum Policy Optimization for Aligning Vision-Language Models in Complex Reasoning [17.928214942495412]
ACPO employs a dynamic curriculum that orchestrates a principled transition from a stable, near on-policy exploration phase to an efficient, off-policy exploitation phase.<n>We conduct extensive experiments on a suite of challenging multimodal reasoning benchmarks, including MathVista, LogicVista, and MMMU-Pro.<n>Results demonstrate that ACPO consistently outperforms strong baselines such as DAPO and PAPO, achieving state-of-the-art performance, accelerated convergence, and superior training stability.
arXiv Detail & Related papers (2025-10-01T09:11:27Z) - GTPO: Trajectory-Based Policy Optimization in Large Language Models [42.60363805227946]
Policy-based optimizations are widely adopted today for the training and alignment of language models.<n>In this paper, we reveal and analyze two major limitations of GRPO.<n>We introduce GTPO, which identifies conflict tokens, tokens appearing in the same position across completions with opposite rewards.
arXiv Detail & Related papers (2025-08-05T08:15:01Z) - Relative Entropy Pathwise Policy Optimization [66.03329137921949]
We present an on-policy algorithm that trains Q-value models purely from on-policy trajectories.<n>We show how to combine policies for exploration with constrained updates for stable training, and evaluate important architectural components that stabilize value function learning.
arXiv Detail & Related papers (2025-07-15T06:24:07Z) - BNPO: Beta Normalization Policy Optimization [9.60676665395923]
We propose a novel policy optimization method that adaptively normalizes rewards using a Beta distribution with dynamically updated parameters.<n>We provide theoretical analysis demonstrating BNPO's variance-reducing properties and show that it generalizes both REINFORCE and GRPO under binary-valued reward settings.<n> Experimental results confirm that BNPO achieves state-of-the-art performance among policy optimization methods on reasoning tasks.
arXiv Detail & Related papers (2025-06-03T13:28:57Z) - On-Policy RL with Optimal Reward Baseline [109.47676554514193]
On-Policy RL with Optimal reward baseline (OPO) is a novel and simplified reinforcement learning algorithm.<n>OPO emphasizes the importance of exact on-policy training, which empirically stabilizes the training process and enhances exploration.<n>Results demonstrate OPO's superior performance and training stability without additional models or regularization terms.
arXiv Detail & Related papers (2025-05-29T15:58:04Z) - Efficient Safety Alignment of Large Language Models via Preference Re-ranking and Representation-based Reward Modeling [84.00480999255628]
Reinforcement Learning algorithms for safety alignment of Large Language Models (LLMs) encounter the challenge of distribution shift.<n>Current approaches typically address this issue through online sampling from the target policy.<n>We propose a new framework that leverages the model's intrinsic safety judgment capability to extract reward signals.
arXiv Detail & Related papers (2025-03-13T06:40:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.