On the Design of KL-Regularized Policy Gradient Algorithms for LLM Reasoning
- URL: http://arxiv.org/abs/2505.17508v2
- Date: Sun, 28 Sep 2025 23:42:53 GMT
- Title: On the Design of KL-Regularized Policy Gradient Algorithms for LLM Reasoning
- Authors: Yifan Zhang, Yifeng Liu, Huizhuo Yuan, Yang Yuan, Quanquan Gu, Andrew Chi-Chih Yao,
- Abstract summary: Policy gradient algorithms have been successfully applied to enhance the reasoning capabilities of large language models (LLMs)<n>Regularized Policy Gradient (RPG) view shows that the widely-used $k_3$ penalty is exactly the unnormalized KL.<n>RPG-REINFORCE with RPG-Style Clip improves accuracy by up to $+6$ absolute percentage points over DAPO.
- Score: 59.11784194183928
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Policy gradient algorithms have been successfully applied to enhance the reasoning capabilities of large language models (LLMs). KL regularization is ubiquitous, yet the design surface, choice of KL direction (forward vs. reverse), normalization (normalized vs. unnormalized), and estimator ($k_1/k_2/k_3$), is scattered across the literature and often intertwined with off-policy estimation. We ask a focused question: under the off-policy setting, what weighting is required for each KL variant so that the surrogate we optimize yields the exact gradient of the intended KL-regularized objective? We answer this with a compact, unified derivation we call the Regularized Policy Gradient (RPG) view. RPG (i) unifies normalized and unnormalized KL variants and shows that the widely-used $k_3$ penalty is exactly the unnormalized KL; (ii) specifies conditions under which REINFORCE-style losses with stop-gradient are gradient-equivalent to fully differentiable surrogates; (iii) identifies and corrects an off-policy importance-weighting mismatch in GRPO's KL term; and (iv) introduces RPG-Style Clip, a truncated-importance-sampling step within RPG-REINFORCE that enables stable, off-policy policy-gradient training at scale. On mathematical reasoning benchmarks (AIME24, AIME25), RPG-REINFORCE with RPG-Style Clip improves accuracy by up to $+6$ absolute percentage points over DAPO. Notably, RPG is a stable and scalable RL algorithm for LLM reasoning, realized via (a) a KL-correct objective, (b) truncated importance sampling, and (c) an iterative reference-policy update scheme.
Related papers
- Approximation of Log-Partition Function in Policy Mirror Descent Induces Implicit Regularization for LLM Post-Training [33.61029387987583]
Policy mirror descent (PMD) provides a principled framework for reinforcement learning (RL)<n>We investigate a practical algorithm, termed PMD-mean, that approximates the log-partition term with the mean reward under the sampling policy.<n>Experiments on math reasoning tasks show that PMD-mean achieves superior performance with improved stability and time efficiency.
arXiv Detail & Related papers (2026-02-05T17:44:28Z) - Why GRPO Needs Normalization: A Local-Curvature Perspective on Adaptive Gradients [9.932325888357488]
Group Relative Policy Optimization (GRPO) is the de facto standard inReinforcement Learning (RL) algorithms.<n>We provide an explanation through the lens of local curvature of the sequence-level policy gradient: standard deviation normalization implements an adaptive gradient.<n>We show that under mild conditions, GRPO enjoys a strictly improved convergence rate over unnormalized REINFORCE, with gains characterized by the average within-prompt reward standard deviation.
arXiv Detail & Related papers (2026-01-30T16:23:43Z) - Non-Asymptotic Global Convergence of PPO-Clip [23.221917827987625]
This paper advances the theoretical foundations of the PPO-Clip algorithm by analyzing a deterministic actor-only PPO algorithm within the general RL setting.<n>We derive a non-uniform Lipschitz smoothness condition and a ojasiewicz inequality for the considered problem.
arXiv Detail & Related papers (2025-12-18T14:06:37Z) - Achieving Logarithmic Regret in KL-Regularized Zero-Sum Markov Games [53.447182734351]
We develop and analyze algorithms that provably achieve improved sample efficiency under Reverse Kullback-Leibler (KL) regularization.<n>We study both two-player zero-sum Matrix games and Markov games: for Matrix games, we propose OMG, an algorithm based on best response sampling with optimistic bonuses, and extend this idea to Markov games through the algorithm SOMG.<n>Both algorithms achieve a logarithmic regret in $T$ that scales inversely with the KL regularization strength $beta$ in addition to the standard $widetildemathcalO(sqrtT)
arXiv Detail & Related papers (2025-10-15T01:00:54Z) - Stabilizing Policy Gradients for Sample-Efficient Reinforcement Learning in LLM Reasoning [77.92320830700797]
Reinforcement Learning has played a central role in enabling reasoning capabilities of Large Language Models.<n>We propose a tractable computational framework that tracks and leverages curvature information during policy updates.<n>The algorithm, Curvature-Aware Policy Optimization (CAPO), identifies samples that contribute to unstable updates and masks them out.
arXiv Detail & Related papers (2025-10-01T12:29:32Z) - CPGD: Toward Stable Rule-based Reinforcement Learning for Language Models [11.295986905174635]
Rule-based reinforcement learning (RL) has improved the reasoning capability of language models (LMs) with rule-based rewards.<n>Existing RL methods often suffer from training instability, where large policy updates and improper clipping can lead to training collapse.<n>We propose Clipped Policy Gradient Optimization with Policy Drift (CPGD), a novel algorithm designed to stabilize policy learning in LMs.
arXiv Detail & Related papers (2025-05-18T17:44:53Z) - Logarithmic Regret for Online KL-Regularized Reinforcement Learning [51.113248212150964]
KL-regularization plays a pivotal role in improving efficiency of RL fine-tuning for large language models.<n>Despite its empirical advantage, the theoretical difference between KL-regularized RL and standard RL remains largely under-explored.<n>We propose an optimistic-based KL-regularized online contextual bandit algorithm, and provide a novel analysis of its regret.
arXiv Detail & Related papers (2025-02-11T11:11:05Z) - Behavior-Regularized Diffusion Policy Optimization for Offline Reinforcement Learning [22.333460316347264]
We introduce BDPO, a principled behavior-regularized RL framework tailored for diffusion-based policies.<n>We develop an efficient two-time-scale actor-critic RL algorithm that produces the optimal policy while respecting the behavior constraint.
arXiv Detail & Related papers (2025-02-07T09:30:35Z) - Last-Iterate Global Convergence of Policy Gradients for Constrained Reinforcement Learning [62.81324245896717]
We introduce an exploration-agnostic algorithm, called C-PG, which exhibits global last-ite convergence guarantees under (weak) gradient domination assumptions.
We numerically validate our algorithms on constrained control problems, and compare them with state-of-the-art baselines.
arXiv Detail & Related papers (2024-07-15T14:54:57Z) - WARP: On the Benefits of Weight Averaged Rewarded Policies [66.95013068137115]
We introduce a novel alignment strategy named Weight Averaged Rewarded Policies (WARP)
WARP merges policies in the weight space at three distinct stages.
Experiments with GEMMA policies validate that WARP improves their quality and alignment, outperforming other open-source LLMs.
arXiv Detail & Related papers (2024-06-24T16:24:34Z) - Diffusion-based Reinforcement Learning via Q-weighted Variational Policy Optimization [55.97310586039358]
Diffusion models have garnered widespread attention in Reinforcement Learning (RL) for their powerful expressiveness and multimodality.<n>We propose a novel model-free diffusion-based online RL algorithm, Q-weighted Variational Policy Optimization (QVPO)<n>Specifically, we introduce the Q-weighted variational loss, which can be proved to be a tight lower bound of the policy objective in online RL under certain conditions.<n>We also develop an efficient behavior policy to enhance sample efficiency by reducing the variance of the diffusion policy during online interactions.
arXiv Detail & Related papers (2024-05-25T10:45:46Z) - Offline Policy Optimization in RL with Variance Regularizaton [142.87345258222942]
We propose variance regularization for offline RL algorithms, using stationary distribution corrections.
We show that by using Fenchel duality, we can avoid double sampling issues for computing the gradient of the variance regularizer.
The proposed algorithm for offline variance regularization (OVAR) can be used to augment any existing offline policy optimization algorithms.
arXiv Detail & Related papers (2022-12-29T18:25:01Z) - Policy Gradient for Reinforcement Learning with General Utilities [50.65940899590487]
In Reinforcement Learning (RL), the goal of agents is to discover an optimal policy that maximizes the expected cumulative rewards.
Many supervised and unsupervised RL problems are not covered in the Linear RL framework.
We derive the policy gradient theorem for RL with general utilities.
arXiv Detail & Related papers (2022-10-03T14:57:46Z) - Global Convergence Using Policy Gradient Methods for Model-free
Markovian Jump Linear Quadratic Control [8.98732207994362]
We study the global convergence of gradient-based policy optimization methods for control of discrete-time and model-free Markovian jump linear systems.
We show global convergence of the policy using gradient descent and natural policy gradient methods.
arXiv Detail & Related papers (2021-11-30T09:26:26Z) - Policy Mirror Descent for Regularized Reinforcement Learning: A
Generalized Framework with Linear Convergence [60.20076757208645]
This paper proposes a general policy mirror descent (GPMD) algorithm for solving regularized RL.
We demonstrate that our algorithm converges linearly over an entire range learning rates, in a dimension-free fashion, to the global solution.
arXiv Detail & Related papers (2021-05-24T02:21:34Z) - Escaping from Zero Gradient: Revisiting Action-Constrained Reinforcement
Learning via Frank-Wolfe Policy Optimization [5.072893872296332]
Action-constrained reinforcement learning (RL) is a widely-used approach in various real-world applications.
We propose a learning algorithm that decouples the action constraints from the policy parameter update.
We show that the proposed algorithm significantly outperforms the benchmark methods on a variety of control tasks.
arXiv Detail & Related papers (2021-02-22T14:28:03Z) - Structured Policy Iteration for Linear Quadratic Regulator [40.52288246664592]
We introduce the textitStructured Policy Iteration (S-PI) for LQR, a method capable of deriving a structured linear policy.
Such a structured policy with (block) sparsity or low-rank can have significant advantages over the standard LQR policy.
In both the known-model and model-free setting, we prove convergence analysis under the proper choice of parameters.
arXiv Detail & Related papers (2020-07-13T06:03:15Z) - When Will Generative Adversarial Imitation Learning Algorithms Attain
Global Convergence [56.40794592158596]
We study generative adversarial imitation learning (GAIL) under general MDP and for nonlinear reward function classes.
This is the first systematic theoretical study of GAIL for global convergence.
arXiv Detail & Related papers (2020-06-24T06:24:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.