Phasic Policy Gradient
- URL: http://arxiv.org/abs/2009.04416v1
- Date: Wed, 9 Sep 2020 16:52:53 GMT
- Title: Phasic Policy Gradient
- Authors: Karl Cobbe, Jacob Hilton, Oleg Klimov, John Schulman
- Abstract summary: In prior methods, one must choose between using a shared network or separate networks to represent the policy and value function.
We introduce Phasic Policy Gradient, a reinforcement learning framework which modifies traditional on-policy actor-critic methods by separating policy and value function training into distinct phases.
PPG is able to achieve the best of both worlds by splitting optimization into two phases, one that advances training and one that distills features.
- Score: 24.966649684989367
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce Phasic Policy Gradient (PPG), a reinforcement learning framework
which modifies traditional on-policy actor-critic methods by separating policy
and value function training into distinct phases. In prior methods, one must
choose between using a shared network or separate networks to represent the
policy and value function. Using separate networks avoids interference between
objectives, while using a shared network allows useful features to be shared.
PPG is able to achieve the best of both worlds by splitting optimization into
two phases, one that advances training and one that distills features. PPG also
enables the value function to be more aggressively optimized with a higher
level of sample reuse. Compared to PPO, we find that PPG significantly improves
sample efficiency on the challenging Procgen Benchmark.
Related papers
- Policy Gradient with Active Importance Sampling [55.112959067035916]
Policy gradient (PG) methods significantly benefit from IS, enabling the effective reuse of previously collected samples.
However, IS is employed in RL as a passive tool for re-weighting historical samples.
We look for the best behavioral policy from which to collect samples to reduce the policy gradient variance.
arXiv Detail & Related papers (2024-05-09T09:08:09Z) - Deep Reinforcement Learning for Traveling Purchaser Problems [63.37136587778153]
The traveling purchaser problem (TPP) is an important optimization problem with broad applications.
We propose a novel approach based on deep reinforcement learning (DRL), which addresses route construction and purchase planning separately.
By introducing a meta-learning strategy, the policy network can be trained stably on large-sized TPP instances.
arXiv Detail & Related papers (2024-04-03T05:32:10Z) - Clipped-Objective Policy Gradients for Pessimistic Policy Optimization [3.2996723916635275]
Policy gradient methods seek to produce monotonic improvement through bounded changes in policy outputs.
In this work, we find that the performance of PPO, when applied to continuous action spaces, may be consistently improved through a simple change in objective.
We show that the clipped-objective policy gradient (COPG) objective is on average "pessimistic" compared to both the PPO objective and (2) this pessimism promotes enhanced exploration.
arXiv Detail & Related papers (2023-11-10T03:02:49Z) - Local Optimization Achieves Global Optimality in Multi-Agent
Reinforcement Learning [139.53668999720605]
We present a multi-agent PPO algorithm in which the local policy of each agent is updated similarly to vanilla PPO.
We prove that with standard regularity conditions on the Markov game and problem-dependent quantities, our algorithm converges to the globally optimal policy at a sublinear rate.
arXiv Detail & Related papers (2023-05-08T16:20:03Z) - Improving Deep Policy Gradients with Value Function Search [21.18135854494779]
This paper focuses on improving value approximation and analyzing the effects on Deep PG primitives.
We introduce a Value Function Search that employs a population of perturbed value networks to search for a better approximation.
Our framework does not require additional environment interactions, gradient computations, or ensembles.
arXiv Detail & Related papers (2023-02-20T18:23:47Z) - You May Not Need Ratio Clipping in PPO [117.03368180633463]
Proximal Policy Optimization (PPO) methods learn a policy by iteratively performing multiple mini-batch optimization epochs of a surrogate objective with one set of sampled data.
Ratio clipping PPO is a popular variant that clips the probability ratios between the target policy and the policy used to collect samples.
We show in this paper that such ratio clipping may not be a good option as it can fail to effectively bound the ratios.
We show that ESPO can be easily scaled up to distributed training with many workers, delivering strong performance as well.
arXiv Detail & Related papers (2022-01-31T20:26:56Z) - Optimal Estimation of Off-Policy Policy Gradient via Double Fitted
Iteration [39.250754806600135]
Policy (PG) estimation becomes a challenge when we are not allowed to sample with the target policy.
Conventional methods for off-policy PG estimation often suffer from significant bias or exponentially large variance.
In this paper, we propose the double Fitted PG estimation (FPG) algorithm.
arXiv Detail & Related papers (2022-01-31T20:23:52Z) - Semi-On-Policy Training for Sample Efficient Multi-Agent Policy
Gradients [51.749831824106046]
We introduce semi-on-policy (SOP) training as an effective and computationally efficient way to address the sample inefficiency of on-policy policy gradient methods.
We show that our methods perform as well or better than state-of-the-art value-based methods on a variety of SMAC tasks.
arXiv Detail & Related papers (2021-04-27T19:37:01Z) - The Surprising Effectiveness of MAPPO in Cooperative, Multi-Agent Games [67.47961797770249]
Multi-Agent PPO (MAPPO) is a multi-agent PPO variant which adopts a centralized value function.
We show that MAPPO achieves performance comparable to the state-of-the-art in three popular multi-agent testbeds.
arXiv Detail & Related papers (2021-03-02T18:59:56Z) - Proximal Policy Gradient: PPO with Policy Gradient [13.571988925615486]
We propose a new algorithm PPG (Proximal Policy Gradient), which is close to both VPG (vanilla policy gradient) and PPO (proximal policy optimization)
The performance of PPG is comparable to PPO, and the entropy decays slower than PPG.
arXiv Detail & Related papers (2020-10-20T00:14:57Z) - Queueing Network Controls via Deep Reinforcement Learning [0.0]
We develop a Proximal policy optimization algorithm for queueing networks.
The algorithm consistently generates control policies that outperform state-of-arts in literature.
A key to the successes of our PPO algorithm is the use of three variance reduction techniques in estimating the relative value function.
arXiv Detail & Related papers (2020-07-31T01:02:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.