OPAC: Opportunistic Actor-Critic
- URL: http://arxiv.org/abs/2012.06555v1
- Date: Fri, 11 Dec 2020 18:33:35 GMT
- Title: OPAC: Opportunistic Actor-Critic
- Authors: Srinjoy Roy, Saptam Bakshi, Tamal Maharaj
- Abstract summary: Opportunistic Actor-Critic (OPAC) is a novel model-free deep RL algorithm that employs better exploration policy and lesser variance.
OPAC combines some of the most powerful features of TD3 and SAC and aims to optimize a policy in an off-policy way.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Actor-critic methods, a type of model-free reinforcement learning (RL), have
achieved state-of-the-art performances in many real-world domains in continuous
control. Despite their success, the wide-scale deployment of these models is
still a far cry. The main problems in these actor-critic methods are
inefficient exploration and sub-optimal policies. Soft Actor-Critic (SAC) and
Twin Delayed Deep Deterministic Policy Gradient (TD3), two cutting edge such
algorithms, suffer from these issues. SAC effectively addressed the problems of
sample complexity and convergence brittleness to hyper-parameters and thus
outperformed all state-of-the-art algorithms including TD3 in harder tasks,
whereas TD3 produced moderate results in all environments. SAC suffers from
inefficient exploration owing to the Gaussian nature of its policy which causes
borderline performance in simpler tasks. In this paper, we introduce
Opportunistic Actor-Critic (OPAC), a novel model-free deep RL algorithm that
employs better exploration policy and lesser variance. OPAC combines some of
the most powerful features of TD3 and SAC and aims to optimize a stochastic
policy in an off-policy way. For calculating the target Q-values, instead of
two critics, OPAC uses three critics and based on the environment complexity,
opportunistically chooses how the target Q-value is computed from the critics'
evaluation. We have systematically evaluated the algorithm on MuJoCo
environments where it achieves state-of-the-art performance and outperforms or
at least equals the performance of TD3 and SAC.
Related papers
- DSAC-T: Distributional Soft Actor-Critic with Three Refinements [31.590177154247485]
We introduce an off-policy RL algorithm called distributional soft actor-critic (DSAC)
Standard DSAC has its own shortcomings, including occasionally unstable learning processes and the necessity for task-specific reward scaling.
This paper introduces three important refinements to standard DSAC in order to address these shortcomings.
arXiv Detail & Related papers (2023-10-09T16:52:48Z) - Soft Decomposed Policy-Critic: Bridging the Gap for Effective Continuous
Control with Discrete RL [47.80205106726076]
We present the Soft Decomposed Policy-Critic (SDPC) architecture, which combines soft RL and actor-critic techniques with discrete RL methods to overcome this limitation.
SDPC discretizes each action dimension independently and employs a shared critic network to maximize the soft $Q$-function.
Our proposed approach outperforms state-of-the-art continuous RL algorithms in a variety of continuous control tasks, including Mujoco's Humanoid and Box2d's BiWalker.
arXiv Detail & Related papers (2023-08-20T08:32:11Z) - PAC-Bayesian Soft Actor-Critic Learning [9.752336113724928]
Actor-critic algorithms address the dual goals of reinforcement learning (RL), policy evaluation and improvement via two separate function approximators.
We tackle this bottleneck by employing an existing Probably Approximately Correct (PAC) Bayesian bound for the first time as the critic training objective of the Soft Actor-Critic (SAC) algorithm.
arXiv Detail & Related papers (2023-01-30T10:44:15Z) - Dealing with Sparse Rewards in Continuous Control Robotics via
Heavy-Tailed Policies [64.2210390071609]
We present a novel Heavy-Tailed Policy Gradient (HT-PSG) algorithm to deal with the challenges of sparse rewards in continuous control problems.
We show consistent performance improvement across all tasks in terms of high average cumulative reward.
arXiv Detail & Related papers (2022-06-12T04:09:39Z) - Evolving Pareto-Optimal Actor-Critic Algorithms for Generalizability and
Stability [67.8426046908398]
Generalizability and stability are two key objectives for operating reinforcement learning (RL) agents in the real world.
This paper presents MetaPG, an evolutionary method for automated design of actor-critic loss functions.
arXiv Detail & Related papers (2022-04-08T20:46:16Z) - Doubly Robust Off-Policy Actor-Critic: Convergence and Optimality [131.45028999325797]
We develop a doubly robust off-policy AC (DR-Off-PAC) for discounted MDP.
DR-Off-PAC adopts a single timescale structure, in which both actor and critics are updated simultaneously with constant stepsize.
We study the finite-time convergence rate and characterize the sample complexity for DR-Off-PAC to attain an $epsilon$-accurate optimal policy.
arXiv Detail & Related papers (2021-02-23T18:56:13Z) - OffCon$^3$: What is state of the art anyway? [20.59974596074688]
Two popular approaches to model-free continuous control tasks are SAC and TD3.
TD3 is derived from DPG, which uses a deterministic policy to perform policy ascent along the value function.
OffCon$3$ is a code base featuring state-of-the-art versions of both algorithms.
arXiv Detail & Related papers (2021-01-27T11:45:08Z) - Band-limited Soft Actor Critic Model [15.11069042369131]
Soft Actor Critic (SAC) algorithms show remarkable performance in complex simulated environments.
We take this idea one step further by artificially bandlimiting the target critic spatial resolution.
We derive the closed form solution in the linear case and show that bandlimiting reduces the interdependency between the low frequency components of the state-action value approximation.
arXiv Detail & Related papers (2020-06-19T22:52:43Z) - Kalman meets Bellman: Improving Policy Evaluation through Value Tracking [59.691919635037216]
Policy evaluation is a key process in Reinforcement Learning (RL)
We devise an optimization method, called Kalman Optimization for Value Approximation (KOVA)
KOVA minimizes a regularized objective function that concerns both parameter and noisy return uncertainties.
arXiv Detail & Related papers (2020-02-17T13:30:43Z) - Discrete Action On-Policy Learning with Action-Value Critic [72.20609919995086]
Reinforcement learning (RL) in discrete action space is ubiquitous in real-world applications, but its complexity grows exponentially with the action-space dimension.
We construct a critic to estimate action-value functions, apply it on correlated actions, and combine these critic estimated action values to control the variance of gradient estimation.
These efforts result in a new discrete action on-policy RL algorithm that empirically outperforms related on-policy algorithms relying on variance control techniques.
arXiv Detail & Related papers (2020-02-10T04:23:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.