A dynamical clipping approach with task feedback for Proximal Policy
Optimization
- URL: http://arxiv.org/abs/2312.07624v2
- Date: Fri, 8 Mar 2024 02:37:16 GMT
- Title: A dynamical clipping approach with task feedback for Proximal Policy
Optimization
- Authors: Ziqi Zhang, Jingzehua Xu, Zifeng Zhuang, Jinxin Liu, Donglin wang,
Shuai Zhang
- Abstract summary: There is no theoretical proof that the optimal clipping bound remains consistent throughout the entire training process.
Previous research suggests that a fixed clipping bound limits the agent's exploration.
We introduce a new algorithm named Preference based Proximal Policy Optimization (Pb-PPO)
- Score: 31.823327359782162
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Proximal Policy Optimization (PPO) has been broadly applied to various
domains, including Large Language Model (LLM) optimization and Robotics
learning, etc. However, PPO is limited by a fixed setting for the clipping
bound. Specifically, there is no theoretical proof that the optimal clipping
bound remains consistent throughout the entire training process. Truncating the
ratio of the new and old policies with a unique clipping bound ensures stable
training and can achieve the best training performance. Additionally, previous
research suggests that a fixed clipping bound limits the agent's exploration.
Therefore, researching a dynamical clipping bound to enhance PPO's performance
can be highly beneficial. Different from previous clipping approaches, we
consider increasing the maximum cumulative Return in reinforcement learning
(RL) tasks as the preference of the RL task, and propose a bi-level proximal
policy optimization paradigm, which involves not only optimizing the policy but
also dynamically adjusting the clipping bound to reflect the preference of the
RL tasks to further elevate the training outcomes and stability of PPO. Based
on this bi-level proximal policy optimization paradigm, we introduce a new
algorithm named Preference based Proximal Policy Optimization (Pb-PPO). This
algorithm utilizes a multi-armed bandit algorithm to reflect RL preferences (we
also validate that such approach can be utilized to reflect human preference),
recommending the optimal clipping bound for PPO in each epoch, thereby
achieving more stable and better training outcomes.
Related papers
- Accelerating RL for LLM Reasoning with Optimal Advantage Regression [52.0792918455501]
We propose a novel two-stage policy optimization framework that directly approximates the optimal advantage function.<n>$A$*-PO achieves competitive performance across a wide range of mathematical reasoning benchmarks.<n>It reduces training time by up to 2$times$ and peak memory usage by over 30% compared to PPO, GRPO, and REBEL.
arXiv Detail & Related papers (2025-05-27T03:58:50Z) - CPPO: Accelerating the Training of Group Relative Policy Optimization-Based Reasoning Models [68.26281707780761]
This paper introduces Completion Pruning Policy Optimization (CPPO) to accelerate the training of reasoning models.
We show that CPPO achieves up to $8.32times$ speedup on GSM8K and $3.51times$ on Math while preserving or even enhancing the accuracy compared to the original GRPO.
arXiv Detail & Related papers (2025-03-28T11:30:05Z) - Beyond the Boundaries of Proximal Policy Optimization [17.577317574595206]
This work offers an alternative perspective of PPO, in which it is decomposed into the inner-loop estimation of update vectors.
We propose outer proximal policy optimization (outer-PPO); a framework wherein these update vectors are applied using an arbitrary gradient-based gradient.
Methods are evaluated against an aggressively tuned PPO baseline on Brax, Jumanji and MinAtar environments.
arXiv Detail & Related papers (2024-11-01T15:29:10Z) - Accelerated Preference Optimization for Large Language Model Alignment [60.22606527763201]
Reinforcement Learning from Human Feedback (RLHF) has emerged as a pivotal tool for aligning large language models (LLMs) with human preferences.
Direct Preference Optimization (DPO) formulates RLHF as a policy optimization problem without explicitly estimating the reward function.
We propose a general Accelerated Preference Optimization (APO) framework, which unifies many existing preference optimization algorithms.
arXiv Detail & Related papers (2024-10-08T18:51:01Z) - REBEL: Reinforcement Learning via Regressing Relative Rewards [59.68420022466047]
We propose REBEL, a minimalist RL algorithm for the era of generative models.
In theory, we prove that fundamental RL algorithms like Natural Policy Gradient can be seen as variants of REBEL.
We find that REBEL provides a unified approach to language modeling and image generation with stronger or similar performance as PPO and DPO.
arXiv Detail & Related papers (2024-04-25T17:20:45Z) - Clipped-Objective Policy Gradients for Pessimistic Policy Optimization [3.2996723916635275]
Policy gradient methods seek to produce monotonic improvement through bounded changes in policy outputs.
In this work, we find that the performance of PPO, when applied to continuous action spaces, may be consistently improved through a simple change in objective.
We show that the clipped-objective policy gradient (COPG) objective is on average "pessimistic" compared to both the PPO objective and (2) this pessimism promotes enhanced exploration.
arXiv Detail & Related papers (2023-11-10T03:02:49Z) - Diverse Policy Optimization for Structured Action Space [59.361076277997704]
We propose Diverse Policy Optimization (DPO) to model the policies in structured action space as the energy-based models (EBM)
A novel and powerful generative model, GFlowNet, is introduced as the efficient, diverse EBM-based policy sampler.
Experiments on ATSC and Battle benchmarks demonstrate that DPO can efficiently discover surprisingly diverse policies.
arXiv Detail & Related papers (2023-02-23T10:48:09Z) - You May Not Need Ratio Clipping in PPO [117.03368180633463]
Proximal Policy Optimization (PPO) methods learn a policy by iteratively performing multiple mini-batch optimization epochs of a surrogate objective with one set of sampled data.
Ratio clipping PPO is a popular variant that clips the probability ratios between the target policy and the policy used to collect samples.
We show in this paper that such ratio clipping may not be a good option as it can fail to effectively bound the ratios.
We show that ESPO can be easily scaled up to distributed training with many workers, delivering strong performance as well.
arXiv Detail & Related papers (2022-01-31T20:26:56Z) - Hinge Policy Optimization: Rethinking Policy Improvement and
Reinterpreting PPO [6.33198867705718]
Policy optimization is a fundamental principle for designing reinforcement learning algorithms.
Despite its superior empirical performance, PPO-clip has not been justified via theoretical proof up to date.
This is the first ever that can prove global convergence to an optimal policy for a variant of PPO-clip.
arXiv Detail & Related papers (2021-10-26T15:56:57Z) - Proximal Policy Optimization Smoothed Algorithm [0.0]
We present a PPO variant, named Proximal Policy Optimization Smooth Algorithm (PPOS)
Its critical improvement is the use of a functional clipping method instead of a flat clipping method.
We show that it outperforms the latest PPO variants on both performance and stability in challenging continuous control tasks.
arXiv Detail & Related papers (2020-12-04T07:43:50Z) - Iterative Amortized Policy Optimization [147.63129234446197]
Policy networks are a central feature of deep reinforcement learning (RL) algorithms for continuous control.
From the variational inference perspective, policy networks are a form of textitamortized optimization, optimizing network parameters rather than the policy distributions directly.
We demonstrate that iterative amortized policy optimization, yields performance improvements over direct amortization on benchmark continuous control tasks.
arXiv Detail & Related papers (2020-10-20T23:25:42Z) - Provably Efficient Exploration in Policy Optimization [117.09887790160406]
This paper proposes an Optimistic variant of the Proximal Policy Optimization algorithm (OPPO)
OPPO achieves $tildeO(sqrtd2 H3 T )$ regret.
To the best of our knowledge, OPPO is the first provably efficient policy optimization algorithm that explores.
arXiv Detail & Related papers (2019-12-12T08:40:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.