ReLU to the Rescue: Improve Your On-Policy Actor-Critic with Positive
Advantages
- URL: http://arxiv.org/abs/2306.01460v3
- Date: Fri, 24 Nov 2023 22:31:07 GMT
- Title: ReLU to the Rescue: Improve Your On-Policy Actor-Critic with Positive
Advantages
- Authors: Andrew Jesson and Chris Lu and Gunshi Gupta and Angelos Filos and
Jakob Nicolaus Foerster and Yarin Gal
- Abstract summary: This paper introduces an effective and practical step toward approximate Bayesian inference in on-policy actor-critic deep reinforcement learning.
We show that the additive term is bounded proportional to the Lipschitz constant of the value function, which offers theoretical grounding for spectral normalization of critic weights.
We demonstrate significant improvements for median and interquartile mean metrics over PPO, SAC, and TD3 on the MuJoCo continuous control benchmark.
- Score: 41.30585319670119
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper introduces an effective and practical step toward approximate
Bayesian inference in on-policy actor-critic deep reinforcement learning. This
step manifests as three simple modifications to the Asynchronous Advantage
Actor-Critic (A3C) algorithm: (1) applying a ReLU function to advantage
estimates, (2) spectral normalization of actor-critic weights, and (3)
incorporating dropout as a Bayesian approximation. We prove under standard
assumptions that restricting policy updates to positive advantages optimizes
for value by maximizing a lower bound on the value function plus an additive
term. We show that the additive term is bounded proportional to the Lipschitz
constant of the value function, which offers theoretical grounding for spectral
normalization of critic weights. Finally, our application of dropout
corresponds to approximate Bayesian inference over both the actor and critic
parameters, which enables prudent state-aware exploration around the modes of
the actor via Thompson sampling. Extensive empirical evaluations on diverse
benchmarks reveal the superior performance of our approach compared to existing
on- and off-policy algorithms. We demonstrate significant improvements for
median and interquartile mean metrics over PPO, SAC, and TD3 on the MuJoCo
continuous control benchmark. Moreover, we see improvement over PPO in the
challenging ProcGen generalization benchmark.
Related papers
- Policy Gradient with Active Importance Sampling [55.112959067035916]
Policy gradient (PG) methods significantly benefit from IS, enabling the effective reuse of previously collected samples.
However, IS is employed in RL as a passive tool for re-weighting historical samples.
We look for the best behavioral policy from which to collect samples to reduce the policy gradient variance.
arXiv Detail & Related papers (2024-05-09T09:08:09Z) - PPO-Clip Attains Global Optimality: Towards Deeper Understandings of
Clipping [16.772442831559538]
We establish the first global convergence results of a PPO-Clip variant in both tabular and neural function approximation settings.
Our theoretical findings also mark the first characterization of the influence of the clipping mechanism on PPO-Clip convergence.
arXiv Detail & Related papers (2023-12-19T11:33:18Z) - Clipped-Objective Policy Gradients for Pessimistic Policy Optimization [3.2996723916635275]
Policy gradient methods seek to produce monotonic improvement through bounded changes in policy outputs.
In this work, we find that the performance of PPO, when applied to continuous action spaces, may be consistently improved through a simple change in objective.
We show that the clipped-objective policy gradient (COPG) objective is on average "pessimistic" compared to both the PPO objective and (2) this pessimism promotes enhanced exploration.
arXiv Detail & Related papers (2023-11-10T03:02:49Z) - Improving Deep Policy Gradients with Value Function Search [21.18135854494779]
This paper focuses on improving value approximation and analyzing the effects on Deep PG primitives.
We introduce a Value Function Search that employs a population of perturbed value networks to search for a better approximation.
Our framework does not require additional environment interactions, gradient computations, or ensembles.
arXiv Detail & Related papers (2023-02-20T18:23:47Z) - Robust and Adaptive Temporal-Difference Learning Using An Ensemble of
Gaussian Processes [70.80716221080118]
The paper takes a generative perspective on policy evaluation via temporal-difference (TD) learning.
The OS-GPTD approach is developed to estimate the value function for a given policy by observing a sequence of state-reward pairs.
To alleviate the limited expressiveness associated with a single fixed kernel, a weighted ensemble (E) of GP priors is employed to yield an alternative scheme.
arXiv Detail & Related papers (2021-12-01T23:15:09Z) - Offline RL Without Off-Policy Evaluation [49.11859771578969]
We show that simply doing one step of constrained/regularized policy improvement using an on-policy Q estimate of the behavior policy performs surprisingly well.
This one-step algorithm beats the previously reported results of iterative algorithms on a large portion of the D4RL benchmark.
arXiv Detail & Related papers (2021-06-16T16:04:26Z) - Doubly Robust Off-Policy Actor-Critic: Convergence and Optimality [131.45028999325797]
We develop a doubly robust off-policy AC (DR-Off-PAC) for discounted MDP.
DR-Off-PAC adopts a single timescale structure, in which both actor and critics are updated simultaneously with constant stepsize.
We study the finite-time convergence rate and characterize the sample complexity for DR-Off-PAC to attain an $epsilon$-accurate optimal policy.
arXiv Detail & Related papers (2021-02-23T18:56:13Z) - Variance Penalized On-Policy and Off-Policy Actor-Critic [60.06593931848165]
We propose on-policy and off-policy actor-critic algorithms that optimize a performance criterion involving both mean and variance in the return.
Our approach not only performs on par with actor-critic and prior variance-penalization baselines in terms of expected return, but also generates trajectories which have lower variance in the return.
arXiv Detail & Related papers (2021-02-03T10:06:16Z) - Queueing Network Controls via Deep Reinforcement Learning [0.0]
We develop a Proximal policy optimization algorithm for queueing networks.
The algorithm consistently generates control policies that outperform state-of-arts in literature.
A key to the successes of our PPO algorithm is the use of three variance reduction techniques in estimating the relative value function.
arXiv Detail & Related papers (2020-07-31T01:02:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.