Related papers: BandPO: Bridging Trust Regions and Ratio Clipping via Probability-Aware Bounds for LLM Reinforcement Learning

BandPO: Bridging Trust Regions and Ratio Clipping via Probability-Aware Bounds for LLM Reinforcement Learning

URL: http://arxiv.org/abs/2603.04918v1
Date: Thu, 05 Mar 2026 08:03:05 GMT
Title: BandPO: Bridging Trust Regions and Ratio Clipping via Probability-Aware Bounds for LLM Reinforcement Learning
Authors: Yuan Li, Bo Wang, Yufei Gao, Yuqian Yao, Xinyuan Wang, Zhangyue Yin, Xipeng Qiu,
Abstract summary: BandPO replaces canonical clipping with Band, a unified theoretical operator that projects trust regions into dynamic, probability-aware clipping intervals.<n>We show that BandPO consistently outperforms canonical clipping and Clip-Higher, while robustly mitigating entropy collapse.
Score: 49.25750348525603
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Proximal constraints are fundamental to the stability of the Large Language Model reinforcement learning. While the canonical clipping mechanism in PPO serves as an efficient surrogate for trust regions, we identify a critical bottleneck: fixed bounds strictly constrain the upward update margin of low-probability actions, disproportionately suppressing high-advantage tail strategies and inducing rapid entropy collapse. To address this, we introduce Band-constrained Policy Optimization (BandPO). BandPO replaces canonical clipping with Band, a unified theoretical operator that projects trust regions defined by f-divergences into dynamic, probability-aware clipping intervals. Theoretical analysis confirms that Band effectively resolves this exploration bottleneck. We formulate this mapping as a convex optimization problem, guaranteeing a globally optimal numerical solution while deriving closed-form solutions for specific divergences. Extensive experiments across diverse models and datasets demonstrate that BandPO consistently outperforms canonical clipping and Clip-Higher, while robustly mitigating entropy collapse.

Related papers

Unbiased Dynamic Pruning for Efficient Group-Based Policy Optimization [60.87651283510059]
Group Relative Policy Optimization (GRPO) effectively scales LLM reasoning but incurs prohibitive computational costs.<n>We propose Dynamic Pruning Policy Optimization (DPPO), a framework that enables dynamic pruning while preserving unbiased gradient estimation.<n>To mitigate the data sparsity induced by pruning, we introduce Dense Prompt Packing, a window-based greedy strategy.
arXiv Detail & Related papers (2026-03-04T14:48:53Z)
Rethinking the Trust Region in LLM Reinforcement Learning [72.25890308541334]
Proximal Policy Optimization (PPO) serves as the de facto standard algorithm for Large Language Models (LLMs)<n>We propose Divergence Proximal Policy Optimization (DPPO), which substitutes clipping with a more principled constraint.<n>DPPO achieves superior training and efficiency compared to existing methods, offering a more robust foundation for RL-based fine-tuning.
arXiv Detail & Related papers (2026-02-04T18:59:04Z)
QUATRO: Query-Adaptive Trust Region Policy Optimization for LLM Fine-tuning [30.908304728142983]
We propose Query-Adaptive Trust-Region policy optimization (QUATRO)<n>QUATRO directly enforces trust-region constraints through a principled optimization.<n> Empirically verified on diverse mathematical reasoning benchmarks, QUATRO shows stable behavior under increased policy staleness.
arXiv Detail & Related papers (2026-02-04T14:51:04Z)
Clipping-Free Policy Optimization for Large Language Models [30.663054788473598]
Reinforcement learning has become central to post-training large language models.<n> dominant algorithms rely on clipping mechanisms to introduce optimization issues at scale.<n>We propose Clipping-Free Policy Optimization, which replaces clipping with a convex penalty derived from Total Variation divergence constraints.
arXiv Detail & Related papers (2026-01-30T10:32:37Z)
Coverage Improvement and Fast Convergence of On-policy Preference Learning [67.36750525893514]
Online on-policy preference learning algorithms for language model alignment can significantly outperform their offline counterparts.<n>We analyze how the sampling policy's coverage evolves throughout on-policy training.<n>We develop principled on-policy schemes for reward distillation in the general function class setting.
arXiv Detail & Related papers (2026-01-13T10:46:06Z)
ETR: Outcome-Guided Elastic Trust Regions for Policy Optimization [6.716883192613149]
We propose textbfElastic textbfTrust textbfETR, a dynamic mechanism that aligns optimization constraints with signal quality.<n>ETR consistently outperforms GRPO, achieving superior accuracy while effectively mitigating policy entropy degradation.
arXiv Detail & Related papers (2026-01-07T09:19:53Z)
Non-Asymptotic Global Convergence of PPO-Clip [23.221917827987625]
This paper advances the theoretical foundations of the PPO-Clip algorithm by analyzing a deterministic actor-only PPO algorithm within the general RL setting.<n>We derive a non-uniform Lipschitz smoothness condition and a ojasiewicz inequality for the considered problem.
arXiv Detail & Related papers (2025-12-18T14:06:37Z)
Efficient Safety Alignment of Large Language Models via Preference Re-ranking and Representation-based Reward Modeling [84.00480999255628]
Reinforcement Learning algorithms for safety alignment of Large Language Models (LLMs) encounter the challenge of distribution shift.<n>Current approaches typically address this issue through online sampling from the target policy.<n>We propose a new framework that leverages the model's intrinsic safety judgment capability to extract reward signals.
arXiv Detail & Related papers (2025-03-13T06:40:34Z)
Correcting the Mythos of KL-Regularization: Direct Alignment without Overoptimization via Chi-Squared Preference Optimization [78.82586283794886]
$chi2$-Preference Optimization ($chi$PO) is an efficient offline alignment algorithm provably robust to overoptimization.<n>$chi$PO implements the principle of pessimism in the face of uncertainty via regularization.<n>$chi$PO's simplicity and strong guarantees make it the first practical and general-purpose offline alignment algorithm provably robust to overoptimization.
arXiv Detail & Related papers (2024-07-18T11:08:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.