Related papers: Non-Asymptotic Global Convergence of PPO-Clip

Non-Asymptotic Global Convergence of PPO-Clip

URL: http://arxiv.org/abs/2512.16565v1
Date: Thu, 18 Dec 2025 14:06:37 GMT
Title: Non-Asymptotic Global Convergence of PPO-Clip
Authors: Yin Liu, Qiming Dai, Junyu Zhang, Zaiwen Wen,
Abstract summary: This paper advances the theoretical foundations of the PPO-Clip algorithm by analyzing a deterministic actor-only PPO algorithm within the general RL setting.<n>We derive a non-uniform Lipschitz smoothness condition and a ojasiewicz inequality for the considered problem.
Score: 23.221917827987625
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Reinforcement learning (RL) has gained attention for aligning large language models (LLMs) via reinforcement learning from human feedback (RLHF). The actor-only variants of Proximal Policy Optimization (PPO) are widely applied for their efficiency. These algorithms incorporate a clipping mechanism to improve stability. Besides, a regularization term, such as the reverse KL-divergence or a more general $f$-divergence, is introduced to prevent policy drift. Despite their empirical success, a rigorous theoretical understanding of the problem and the algorithm's properties is limited. This paper advances the theoretical foundations of the PPO-Clip algorithm by analyzing a deterministic actor-only PPO algorithm within the general RL setting with $f$-divergence regularization under the softmax policy parameterization. We derive a non-uniform Lipschitz smoothness condition and a Łojasiewicz inequality for the considered problem. Based on these, a non-asymptotic linear convergence rate to the globally optimal policy is established for the forward KL-regularizer. Furthermore, stationary convergence and local linear convergence are derived for the reverse KL-regularizer.

Related papers

BandPO: Bridging Trust Regions and Ratio Clipping via Probability-Aware Bounds for LLM Reinforcement Learning [49.25750348525603]
BandPO replaces canonical clipping with Band, a unified theoretical operator that projects trust regions into dynamic, probability-aware clipping intervals.<n>We show that BandPO consistently outperforms canonical clipping and Clip-Higher, while robustly mitigating entropy collapse.
arXiv Detail & Related papers (2026-03-05T08:03:05Z)
Rethinking the Trust Region in LLM Reinforcement Learning [72.25890308541334]
Proximal Policy Optimization (PPO) serves as the de facto standard algorithm for Large Language Models (LLMs)<n>We propose Divergence Proximal Policy Optimization (DPPO), which substitutes clipping with a more principled constraint.<n>DPPO achieves superior training and efficiency compared to existing methods, offering a more robust foundation for RL-based fine-tuning.
arXiv Detail & Related papers (2026-02-04T18:59:04Z)
Policy Regularized Distributionally Robust Markov Decision Processes with Linear Function Approximation [10.35045003737115]
Decision-making under distribution shift is a central challenge in reinforcement learning (RL), where training and deployment environments differ.<n>We propose DR-RPO, a model-free online policy optimization method that learns robust policies with sublinear regret.<n>We show that DR-RPO can achieve suboptimality bounds and sample efficiency in robust RL, matching the performance of value-based approaches.
arXiv Detail & Related papers (2025-10-16T02:56:58Z)
Correcting the Mythos of KL-Regularization: Direct Alignment without Overoptimization via Chi-Squared Preference Optimization [78.82586283794886]
$chi2$-Preference Optimization ($chi$PO) is an efficient offline alignment algorithm provably robust to overoptimization.<n>$chi$PO implements the principle of pessimism in the face of uncertainty via regularization.<n>$chi$PO's simplicity and strong guarantees make it the first practical and general-purpose offline alignment algorithm provably robust to overoptimization.
arXiv Detail & Related papers (2024-07-18T11:08:40Z)
Fast Policy Learning for Linear Quadratic Control with Entropy Regularization [8.670376057309193]
This paper proposes and analyzes two new policy learning methods: regularized policy gradient (RPG) and iterative policy optimization (IPO), for a class of discounted linear-quadratic control (LQC) problems.<n> Assuming access to the exact policy evaluation, both proposed approaches are proven to converge linearly in finding optimal policies of the regularized LQC.
arXiv Detail & Related papers (2023-11-23T19:08:39Z)
Offline Policy Optimization in RL with Variance Regularizaton [142.87345258222942]
We propose variance regularization for offline RL algorithms, using stationary distribution corrections. We show that by using Fenchel duality, we can avoid double sampling issues for computing the gradient of the variance regularizer. The proposed algorithm for offline variance regularization (OVAR) can be used to augment any existing offline policy optimization algorithms.
arXiv Detail & Related papers (2022-12-29T18:25:01Z)
Regularization Guarantees Generalization in Bayesian Reinforcement Learning through Algorithmic Stability [48.62272919754204]
We study generalization in Bayesian RL under the probably approximately correct (PAC) framework. Our main contribution is showing that by adding regularization, the optimal policy becomes stable in an appropriate sense.
arXiv Detail & Related papers (2021-09-24T07:48:34Z)
Policy Mirror Descent for Regularized Reinforcement Learning: A Generalized Framework with Linear Convergence [60.20076757208645]
This paper proposes a general policy mirror descent (GPMD) algorithm for solving regularized RL. We demonstrate that our algorithm converges linearly over an entire range learning rates, in a dimension-free fashion, to the global solution.
arXiv Detail & Related papers (2021-05-24T02:21:34Z)
CRPO: A New Approach for Safe Reinforcement Learning with Convergence Guarantee [61.176159046544946]
In safe reinforcement learning (SRL) problems, an agent explores the environment to maximize an expected total reward and avoids violation of certain constraints. This is the first-time analysis of SRL algorithms with global optimal policies.
arXiv Detail & Related papers (2020-11-11T16:05:14Z)
Neural Proximal/Trust Region Policy Optimization Attains Globally Optimal Policy [119.12515258771302]
We show that a variant of PPOO equipped with over-parametrization converges to globally optimal networks. The key to our analysis is the iterate of infinite gradient under a notion of one-dimensional monotonicity, where the gradient and are instant by networks.
arXiv Detail & Related papers (2019-06-25T03:20:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.