Related papers: ReLU to the Rescue: Improve Your On-Policy Actor-Critic with Positive Advantages

ReLU to the Rescue: Improve Your On-Policy Actor-Critic with Positive Advantages

URL: http://arxiv.org/abs/2306.01460v4
Date: Thu, 10 Oct 2024 11:16:28 GMT
Title: ReLU to the Rescue: Improve Your On-Policy Actor-Critic with Positive Advantages
Authors: Andrew Jesson, Chris Lu, Gunshi Gupta, Nicolas Beltran-Velez, Angelos Filos, Jakob Nicolaus Foerster, Yarin Gal,
Abstract summary: This paper proposes a step toward approximate Bayesian inference in on-policy actor-critic deep reinforcement learning. It is implemented through three changes to the Asynchronous Advantage Actor-Critic (A3C) algorithm.
Score: 37.12048108122337
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This paper proposes a step toward approximate Bayesian inference in on-policy actor-critic deep reinforcement learning. It is implemented through three changes to the Asynchronous Advantage Actor-Critic (A3C) algorithm: (1) applying a ReLU function to advantage estimates, (2) spectral normalization of actor-critic weights, and (3) incorporating \emph{dropout as a Bayesian approximation}. We prove under standard assumptions that restricting policy updates to positive advantages optimizes for value by maximizing a lower bound on the value function plus an additive term. We show that the additive term is bounded proportional to the Lipschitz constant of the value function, which offers theoretical grounding for spectral normalization of critic weights. Finally, our application of dropout corresponds to approximate Bayesian inference over both the actor and critic parameters, which enables \textit{adaptive state-aware} exploration around the modes of the actor via Thompson sampling. We demonstrate significant improvements for median and interquartile mean metrics over A3C, PPO, SAC, and TD3 on the MuJoCo continuous control benchmark and improvement over PPO in the challenging ProcGen generalization benchmark.

Related papers

Inverse Reinforcement Learning using Revealed Preferences and Passive Stochastic Optimization [15.878313629774269]
The first two chapters view inverse reinforcement learning (IRL) through the lens of revealed preferences from microeconomics.<n>The third chapter studies adaptive gradient algorithms.
arXiv Detail & Related papers (2025-07-06T13:56:02Z)
The Actor-Critic Update Order Matters for PPO in Federated Reinforcement Learning [10.727328530242461]
We propose FedRAC, which reverses the update order (actor first, then critic) to eliminate the divergence of critics from different clients.<n> Empirical results indicate that the proposed algorithm obtains higher cumulative rewards and converges more rapidly in five experiments.
arXiv Detail & Related papers (2025-06-02T02:20:22Z)
Policy Gradient with Active Importance Sampling [55.112959067035916]
Policy gradient (PG) methods significantly benefit from IS, enabling the effective reuse of previously collected samples. However, IS is employed in RL as a passive tool for re-weighting historical samples. We look for the best behavioral policy from which to collect samples to reduce the policy gradient variance.
arXiv Detail & Related papers (2024-05-09T09:08:09Z)
PPO-Clip Attains Global Optimality: Towards Deeper Understandings of Clipping [16.772442831559538]
We establish the first global convergence results of a PPO-Clip variant in both tabular and neural function approximation settings. Our theoretical findings also mark the first characterization of the influence of the clipping mechanism on PPO-Clip convergence.
arXiv Detail & Related papers (2023-12-19T11:33:18Z)
Improving Deep Policy Gradients with Value Function Search [21.18135854494779]
This paper focuses on improving value approximation and analyzing the effects on Deep PG primitives. We introduce a Value Function Search that employs a population of perturbed value networks to search for a better approximation. Our framework does not require additional environment interactions, gradient computations, or ensembles.
arXiv Detail & Related papers (2023-02-20T18:23:47Z)
Robust and Adaptive Temporal-Difference Learning Using An Ensemble of Gaussian Processes [70.80716221080118]
The paper takes a generative perspective on policy evaluation via temporal-difference (TD) learning. The OS-GPTD approach is developed to estimate the value function for a given policy by observing a sequence of state-reward pairs. To alleviate the limited expressiveness associated with a single fixed kernel, a weighted ensemble (E) of GP priors is employed to yield an alternative scheme.
arXiv Detail & Related papers (2021-12-01T23:15:09Z)
Doubly Robust Off-Policy Actor-Critic: Convergence and Optimality [131.45028999325797]
We develop a doubly robust off-policy AC (DR-Off-PAC) for discounted MDP. DR-Off-PAC adopts a single timescale structure, in which both actor and critics are updated simultaneously with constant stepsize. We study the finite-time convergence rate and characterize the sample complexity for DR-Off-PAC to attain an $epsilon$-accurate optimal policy.
arXiv Detail & Related papers (2021-02-23T18:56:13Z)
Variance Penalized On-Policy and Off-Policy Actor-Critic [60.06593931848165]
We propose on-policy and off-policy actor-critic algorithms that optimize a performance criterion involving both mean and variance in the return. Our approach not only performs on par with actor-critic and prior variance-penalization baselines in terms of expected return, but also generates trajectories which have lower variance in the return.
arXiv Detail & Related papers (2021-02-03T10:06:16Z)
Single-Timescale Actor-Critic Provably Finds Globally Optimal Policy [122.01837436087516]
We study the global convergence and global optimality of actor-critic, one of the most popular families of reinforcement learning algorithms. We establish the rate of convergence and global optimality of single-timescale actor-critic with linear function approximation for the first time.
arXiv Detail & Related papers (2020-08-02T14:01:49Z)
Queueing Network Controls via Deep Reinforcement Learning [0.0]
We develop a Proximal policy optimization algorithm for queueing networks. The algorithm consistently generates control policies that outperform state-of-arts in literature. A key to the successes of our PPO algorithm is the use of three variance reduction techniques in estimating the relative value function.
arXiv Detail & Related papers (2020-07-31T01:02:57Z)
Distributional Soft Actor-Critic: Off-Policy Reinforcement Learning for Addressing Value Estimation Errors [13.534873779043478]
We present a distributional soft actor-critic (DSAC) algorithm to improve the policy performance by mitigating Q-value overestimations. We evaluate DSAC on the suite of MuJoCo continuous control tasks, achieving the state-of-the-art performance.
arXiv Detail & Related papers (2020-01-09T02:27:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.