On Proximal Policy Optimization's Heavy-tailed Gradients
- URL: http://arxiv.org/abs/2102.10264v1
- Date: Sat, 20 Feb 2021 05:51:28 GMT
- Title: On Proximal Policy Optimization's Heavy-tailed Gradients
- Authors: Saurabh Garg, Joshua Zhanson, Emilio Parisotto, Adarsh Prasad, J. Zico
Kolter, Sivaraman Balakrishnan, Zachary C. Lipton, Ruslan Salakhutdinov and
Pradeep Ravikumar
- Abstract summary: We study the heavy-tailed nature of the gradients of the Proximal Policy Optimization surrogate reward function.
In this paper, we study the effects of the standard PPO clippings, demonstrating that these tricks primarily serve to offset heavy-tailedness in gradients.
We propose incorporating GMOM, a high-dimensional robust estimator, into PPO as a substitute for three clipping tricks.
- Score: 150.08522793940708
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Modern policy gradient algorithms, notably Proximal Policy Optimization
(PPO), rely on an arsenal of heuristics, including loss clipping and gradient
clipping, to ensure successful learning. These heuristics are reminiscent of
techniques from robust statistics, commonly used for estimation in outlier-rich
("heavy-tailed") regimes. In this paper, we present a detailed empirical study
to characterize the heavy-tailed nature of the gradients of the PPO surrogate
reward function. We demonstrate that the gradients, especially for the actor
network, exhibit pronounced heavy-tailedness and that it increases as the
agent's policy diverges from the behavioral policy (i.e., as the agent goes
further off policy). Further examination implicates the likelihood ratios and
advantages in the surrogate reward as the main sources of the observed
heavy-tailedness. We then highlight issues arising due to the heavy-tailed
nature of the gradients. In this light, we study the effects of the standard
PPO clipping heuristics, demonstrating that these tricks primarily serve to
offset heavy-tailedness in gradients. Thus motivated, we propose incorporating
GMOM, a high-dimensional robust estimator, into PPO as a substitute for three
clipping tricks. Despite requiring less hyperparameter tuning, our method
matches the performance of PPO (with all heuristics enabled) on a battery of
MuJoCo continuous control tasks.
Related papers
- Policy Gradient with Active Importance Sampling [55.112959067035916]
Policy gradient (PG) methods significantly benefit from IS, enabling the effective reuse of previously collected samples.
However, IS is employed in RL as a passive tool for re-weighting historical samples.
We look for the best behavioral policy from which to collect samples to reduce the policy gradient variance.
arXiv Detail & Related papers (2024-05-09T09:08:09Z) - Learning Optimal Deterministic Policies with Stochastic Policy Gradients [62.81324245896716]
Policy gradient (PG) methods are successful approaches to deal with continuous reinforcement learning (RL) problems.
In common practice, convergence (hyper)policies are learned only to deploy their deterministic version.
We show how to tune the exploration level used for learning to optimize the trade-off between the sample complexity and the performance of the deployed deterministic policy.
arXiv Detail & Related papers (2024-05-03T16:45:15Z) - Clipped-Objective Policy Gradients for Pessimistic Policy Optimization [3.2996723916635275]
Policy gradient methods seek to produce monotonic improvement through bounded changes in policy outputs.
In this work, we find that the performance of PPO, when applied to continuous action spaces, may be consistently improved through a simple change in objective.
We show that the clipped-objective policy gradient (COPG) objective is on average "pessimistic" compared to both the PPO objective and (2) this pessimism promotes enhanced exploration.
arXiv Detail & Related papers (2023-11-10T03:02:49Z) - Model-Based Reparameterization Policy Gradient Methods: Theory and
Practical Algorithms [88.74308282658133]
Reization (RP) Policy Gradient Methods (PGMs) have been widely adopted for continuous control tasks in robotics and computer graphics.
Recent studies have revealed that, when applied to long-term reinforcement learning problems, model-based RP PGMs may experience chaotic and non-smooth optimization landscapes.
We propose a spectral normalization method to mitigate the exploding variance issue caused by long model unrolls.
arXiv Detail & Related papers (2023-10-30T18:43:21Z) - Increasing Entropy to Boost Policy Gradient Performance on
Personalization Tasks [0.46040036610482665]
We consider the impact of regularization on the diversity of actions taken by policies generated from reinforcement learning agents trained using a policy gradient.
numerical evidence is given to show that policy regularization increases performance without losing accuracy.
arXiv Detail & Related papers (2023-10-09T01:03:05Z) - PACER: A Fully Push-forward-based Distributional Reinforcement Learning Algorithm [28.48626438603237]
PACER consists of a distributional critic, an actor and a sample-based encourager.
Push-forward operator is leveraged in both the critic and actor to model the return distributions and policies respectively.
A sample-based utility value policy gradient is established for the push-forward policy update.
arXiv Detail & Related papers (2023-06-11T09:45:31Z) - Fine-Tuning Language Models with Advantage-Induced Policy Alignment [80.96507425217472]
We propose a novel algorithm for aligning large language models to human preferences.
We show that it consistently outperforms PPO in language tasks by a large margin.
We also provide a theoretical justification supporting the design of our loss function.
arXiv Detail & Related papers (2023-06-04T01:59:40Z) - Sigmoidally Preconditioned Off-policy Learning:a new exploration method
for reinforcement learning [14.991913317341417]
We focus on an off-policy Actor-Critic architecture, and propose a novel method, called Preconditioned Proximal Policy Optimization (P3O)
P3O can control the high variance of importance sampling by applying a preconditioner to the Conservative Policy Iteration (CPI) objective.
Results show that our P3O maximizes the CPI objective better than PPO during the training process.
arXiv Detail & Related papers (2022-05-20T09:38:04Z) - Model-free Policy Learning with Reward Gradients [9.847875182113137]
We develop the textitReward Policy Gradient estimator, a novel approach that integrates reward gradients without learning a model.
Our method also boosts the performance of Proximal Policy Optimization on different MuJoCo control tasks.
arXiv Detail & Related papers (2021-03-09T00:14:13Z) - Implementation Matters in Deep Policy Gradients: A Case Study on PPO and
TRPO [90.90009491366273]
We study the roots of algorithmic progress in deep policy gradient algorithms through a case study on two popular algorithms.
Specifically, we investigate the consequences of "code-level optimizations:"
Our results show that they (a) are responsible for most of PPO's gain in cumulative reward over TRPO, and (b) fundamentally change how RL methods function.
arXiv Detail & Related papers (2020-05-25T16:24:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.