CPGD: Toward Stable Rule-based Reinforcement Learning for Language Models
- URL: http://arxiv.org/abs/2505.12504v1
- Date: Sun, 18 May 2025 17:44:53 GMT
- Title: CPGD: Toward Stable Rule-based Reinforcement Learning for Language Models
- Authors: Zongkai Liu, Fanqing Meng, Lingxiao Du, Zhixiang Zhou, Chao Yu, Wenqi Shao, Qiaosheng Zhang,
- Abstract summary: Rule-based reinforcement learning (RL) has improved the reasoning capability of language models (LMs) with rule-based rewards.<n>Existing RL methods often suffer from training instability, where large policy updates and improper clipping can lead to training collapse.<n>We propose Clipped Policy Gradient Optimization with Policy Drift (CPGD), a novel algorithm designed to stabilize policy learning in LMs.
- Score: 11.295986905174635
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent advances in rule-based reinforcement learning (RL) have significantly improved the reasoning capability of language models (LMs) with rule-based rewards. However, existing RL methods -- such as GRPO, REINFORCE++, and RLOO -- often suffer from training instability, where large policy updates and improper clipping can lead to training collapse. To address this issue, we propose Clipped Policy Gradient Optimization with Policy Drift (CPGD), a novel algorithm designed to stabilize policy learning in LMs. CPGD introduces a policy drift constraint based on KL divergence to dynamically regularize policy updates, and leverages a clip mechanism on the logarithm of the ratio to prevent excessive policy updates. We provide theoretical justification for CPGD and demonstrate through empirical analysis that it mitigates the instability observed in prior approaches. Furthermore, we show that CPGD significantly improves performance while maintaining training stability. Our implementation balances theoretical rigor with practical usability, offering a robust alternative for RL in the post-training of LMs. We release our code at https://github.com/ModalMinds/MM-EUREKA.
Related papers
- EXPO: Stable Reinforcement Learning with Expressive Policies [74.30151915786233]
We propose a sample-efficient online reinforcement learning algorithm to maximize value with two parameterized policies.<n>Our approach yields up to 2-3x improvement in sample efficiency on average over prior methods.
arXiv Detail & Related papers (2025-07-10T17:57:46Z) - Accelerating Residual Reinforcement Learning with Uncertainty Estimation [20.516264459225734]
Residual Reinforcement Learning (RL) is a popular approach for adapting pretrained policies by learning a lightweight residual policy that provides corrective actions.<n>While Residual RL is more sample-efficient than finetuning the entire base policy, existing methods struggle with sparse rewards and are designed for deterministic base policies.<n>We propose two improvements to Residual RL that further enhance its sample efficiency and make it suitable for base policies.
arXiv Detail & Related papers (2025-06-21T03:18:01Z) - On the Design of KL-Regularized Policy Gradient Algorithms for LLM Reasoning [50.856589224454055]
Policy gradient algorithms have been successfully applied to enhance the reasoning capabilities of large language models (LLMs)<n>We propose regularized policy gradient (RPG), a framework for deriving and analyzing KL-regularized policy gradient methods in the online reinforcement learning setting.<n>RPG shows improved or competitive results in terms of training stability and performance compared to strong baselines such as GRPO, REINFORCE++, and DAPO.
arXiv Detail & Related papers (2025-05-23T06:01:21Z) - Behavior-Regularized Diffusion Policy Optimization for Offline Reinforcement Learning [22.333460316347264]
We introduce BDPO, a principled behavior-regularized RL framework tailored for diffusion-based policies.<n>We develop an efficient two-time-scale actor-critic RL algorithm that produces the optimal policy while respecting the behavior constraint.
arXiv Detail & Related papers (2025-02-07T09:30:35Z) - Diffusion Actor-Critic: Formulating Constrained Policy Iteration as Diffusion Noise Regression for Offline Reinforcement Learning [13.163511229897667]
In offline reinforcement learning, it is necessary to manage out-of-distribution actions to prevent overestimation of value functions.<n>We propose Diffusion Actor-Critic (DAC) that formulates the Kullback-Leibler (KL) constraint policy iteration as a diffusion noise regression problem.<n>Our approach is evaluated on D4RL benchmarks and outperforms the state-of-the-art in nearly all environments.
arXiv Detail & Related papers (2024-05-31T00:41:04Z) - Statistically Efficient Variance Reduction with Double Policy Estimation
for Off-Policy Evaluation in Sequence-Modeled Reinforcement Learning [53.97273491846883]
We propose DPE: an RL algorithm that blends offline sequence modeling and offline reinforcement learning with Double Policy Estimation.
We validate our method in multiple tasks of OpenAI Gym with D4RL benchmarks.
arXiv Detail & Related papers (2023-08-28T20:46:07Z) - Offline Reinforcement Learning with Closed-Form Policy Improvement
Operators [88.54210578912554]
Behavior constrained policy optimization has been demonstrated to be a successful paradigm for tackling Offline Reinforcement Learning.
In this paper, we propose our closed-form policy improvement operators.
We empirically demonstrate their effectiveness over state-of-the-art algorithms on the standard D4RL benchmark.
arXiv Detail & Related papers (2022-11-29T06:29:26Z) - Supported Policy Optimization for Offline Reinforcement Learning [74.1011309005488]
Policy constraint methods to offline reinforcement learning (RL) typically utilize parameterization or regularization.
Regularization methods reduce the divergence between the learned policy and the behavior policy.
This paper presents Supported Policy OpTimization (SPOT), which is directly derived from the theoretical formalization of the density-based support constraint.
arXiv Detail & Related papers (2022-02-13T07:38:36Z) - Cautious Policy Programming: Exploiting KL Regularization in Monotonic
Policy Improvement for Reinforcement Learning [11.82492300303637]
We propose a novel value-based reinforcement learning (RL) algorithm that can ensure monotonic policy improvement during learning.
We demonstrate that the proposed algorithm can trade o? performance and stability in both didactic classic control problems and challenging high-dimensional Atari games.
arXiv Detail & Related papers (2021-07-13T01:03:10Z) - Robust Deep Reinforcement Learning against Adversarial Perturbations on
State Observations [88.94162416324505]
A deep reinforcement learning (DRL) agent observes its states through observations, which may contain natural measurement errors or adversarial noises.
Since the observations deviate from the true states, they can mislead the agent into making suboptimal actions.
We show that naively applying existing techniques on improving robustness for classification tasks, like adversarial training, is ineffective for many RL tasks.
arXiv Detail & Related papers (2020-03-19T17:59:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.