Proximal Policy Gradient: PPO with Policy Gradient
- URL: http://arxiv.org/abs/2010.09933v1
- Date: Tue, 20 Oct 2020 00:14:57 GMT
- Title: Proximal Policy Gradient: PPO with Policy Gradient
- Authors: Ju-Seung Byun, Byungmoon Kim, Huamin Wang
- Abstract summary: We propose a new algorithm PPG (Proximal Policy Gradient), which is close to both VPG (vanilla policy gradient) and PPO (proximal policy optimization)
The performance of PPG is comparable to PPO, and the entropy decays slower than PPG.
- Score: 13.571988925615486
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we propose a new algorithm PPG (Proximal Policy Gradient),
which is close to both VPG (vanilla policy gradient) and PPO (proximal policy
optimization). The PPG objective is a partial variation of the VPG objective
and the gradient of the PPG objective is exactly same as the gradient of the
VPG objective. To increase the number of policy update iterations, we introduce
the advantage-policy plane and design a new clipping strategy. We perform
experiments in OpenAI Gym and Bullet robotics environments for ten random
seeds. The performance of PPG is comparable to PPO, and the entropy decays
slower than PPG. Thus we show that performance similar to PPO can be obtained
by using the gradient formula from the original policy gradient theorem.
Related papers
- Policy Gradient with Active Importance Sampling [55.112959067035916]
Policy gradient (PG) methods significantly benefit from IS, enabling the effective reuse of previously collected samples.
However, IS is employed in RL as a passive tool for re-weighting historical samples.
We look for the best behavioral policy from which to collect samples to reduce the policy gradient variance.
arXiv Detail & Related papers (2024-05-09T09:08:09Z) - Clipped-Objective Policy Gradients for Pessimistic Policy Optimization [3.2996723916635275]
Policy gradient methods seek to produce monotonic improvement through bounded changes in policy outputs.
In this work, we find that the performance of PPO, when applied to continuous action spaces, may be consistently improved through a simple change in objective.
We show that the clipped-objective policy gradient (COPG) objective is on average "pessimistic" compared to both the PPO objective and (2) this pessimism promotes enhanced exploration.
arXiv Detail & Related papers (2023-11-10T03:02:49Z) - Accelerated Policy Gradient: On the Convergence Rates of the Nesterov Momentum for Reinforcement Learning [12.987019067098412]
We adapt the celebrated Nesterov's accelerated gradient (NAG) method to policy optimization in Reinforcement Learning (RL)
We prove that APG converges to an optimal policy at rates: (i) $tildeO (1/t2)$ with constant step sizes; (ii) $O(e-ct)$ with exponentially-growing step sizes.
arXiv Detail & Related papers (2023-10-18T11:33:22Z) - The Role of Baselines in Policy Gradient Optimization [83.42050606055822]
We show that the emphstate value baseline allows on-policy.
emphnatural policy gradient (NPG) to converge to a globally optimal.
policy at an $O (1/t) rate gradient.
We find that the primary effect of the value baseline is to textbfreduce the aggressiveness of the updates rather than their variance.
arXiv Detail & Related papers (2023-01-16T06:28:00Z) - You May Not Need Ratio Clipping in PPO [117.03368180633463]
Proximal Policy Optimization (PPO) methods learn a policy by iteratively performing multiple mini-batch optimization epochs of a surrogate objective with one set of sampled data.
Ratio clipping PPO is a popular variant that clips the probability ratios between the target policy and the policy used to collect samples.
We show in this paper that such ratio clipping may not be a good option as it can fail to effectively bound the ratios.
We show that ESPO can be easily scaled up to distributed training with many workers, delivering strong performance as well.
arXiv Detail & Related papers (2022-01-31T20:26:56Z) - Optimal Estimation of Off-Policy Policy Gradient via Double Fitted
Iteration [39.250754806600135]
Policy (PG) estimation becomes a challenge when we are not allowed to sample with the target policy.
Conventional methods for off-policy PG estimation often suffer from significant bias or exponentially large variance.
In this paper, we propose the double Fitted PG estimation (FPG) algorithm.
arXiv Detail & Related papers (2022-01-31T20:23:52Z) - On Proximal Policy Optimization's Heavy-tailed Gradients [150.08522793940708]
We study the heavy-tailed nature of the gradients of the Proximal Policy Optimization surrogate reward function.
In this paper, we study the effects of the standard PPO clippings, demonstrating that these tricks primarily serve to offset heavy-tailedness in gradients.
We propose incorporating GMOM, a high-dimensional robust estimator, into PPO as a substitute for three clipping tricks.
arXiv Detail & Related papers (2021-02-20T05:51:28Z) - Phasic Policy Gradient [24.966649684989367]
In prior methods, one must choose between using a shared network or separate networks to represent the policy and value function.
We introduce Phasic Policy Gradient, a reinforcement learning framework which modifies traditional on-policy actor-critic methods by separating policy and value function training into distinct phases.
PPG is able to achieve the best of both worlds by splitting optimization into two phases, one that advances training and one that distills features.
arXiv Detail & Related papers (2020-09-09T16:52:53Z) - Deep Bayesian Quadrature Policy Optimization [100.81242753620597]
Deep Bayesian quadrature policy gradient (DBQPG) is a high-dimensional generalization of Bayesian quadrature for policy gradient estimation.
We show that DBQPG can substitute Monte-Carlo estimation in policy gradient methods, and demonstrate its effectiveness on a set of continuous control benchmarks.
arXiv Detail & Related papers (2020-06-28T15:44:47Z) - Zeroth-order Deterministic Policy Gradient [116.87117204825105]
We introduce Zeroth-order Deterministic Policy Gradient (ZDPG)
ZDPG approximates policy-reward gradients via two-point evaluations of the $Q$function.
New finite sample complexity bounds for ZDPG improve upon existing results by up to two orders of magnitude.
arXiv Detail & Related papers (2020-06-12T16:52:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.