Related papers: OPAC: Opportunistic Actor-Critic

OPAC: Opportunistic Actor-Critic

URL: http://arxiv.org/abs/2012.06555v1
Date: Fri, 11 Dec 2020 18:33:35 GMT
Title: OPAC: Opportunistic Actor-Critic
Authors: Srinjoy Roy, Saptam Bakshi, Tamal Maharaj
Abstract summary: Opportunistic Actor-Critic (OPAC) is a novel model-free deep RL algorithm that employs better exploration policy and lesser variance. OPAC combines some of the most powerful features of TD3 and SAC and aims to optimize a policy in an off-policy way.
Score: 0.0
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Actor-critic methods, a type of model-free reinforcement learning (RL), have achieved state-of-the-art performances in many real-world domains in continuous control. Despite their success, the wide-scale deployment of these models is still a far cry. The main problems in these actor-critic methods are inefficient exploration and sub-optimal policies. Soft Actor-Critic (SAC) and Twin Delayed Deep Deterministic Policy Gradient (TD3), two cutting edge such algorithms, suffer from these issues. SAC effectively addressed the problems of sample complexity and convergence brittleness to hyper-parameters and thus outperformed all state-of-the-art algorithms including TD3 in harder tasks, whereas TD3 produced moderate results in all environments. SAC suffers from inefficient exploration owing to the Gaussian nature of its policy which causes borderline performance in simpler tasks. In this paper, we introduce Opportunistic Actor-Critic (OPAC), a novel model-free deep RL algorithm that employs better exploration policy and lesser variance. OPAC combines some of the most powerful features of TD3 and SAC and aims to optimize a stochastic policy in an off-policy way. For calculating the target Q-values, instead of two critics, OPAC uses three critics and based on the environment complexity, opportunistically chooses how the target Q-value is computed from the critics' evaluation. We have systematically evaluated the algorithm on MuJoCo environments where it achieves state-of-the-art performance and outperforms or at least equals the performance of TD3 and SAC.

Related papers

IL-SOAR : Imitation Learning with Soft Optimistic Actor cRitic [52.44637913176449]
This paper introduces the SOAR framework for imitation learning. It is an algorithmic template that learns a policy from expert demonstrations with a primal dual style algorithm that alternates cost and policy updates. It is shown to boost consistently the performance of imitation learning algorithms based on Soft Actor Critic such as f-IRL, ML-IRL and CSIL in several MuJoCo environments.
arXiv Detail & Related papers (2025-02-27T08:03:37Z)
DSAC-T: Distributional Soft Actor-Critic with Three Refinements [31.590177154247485]
We introduce an off-policy RL algorithm called distributional soft actor-critic (DSAC) Standard DSAC has its own shortcomings, including occasionally unstable learning processes and the necessity for task-specific reward scaling. This paper introduces three important refinements to standard DSAC in order to address these shortcomings.
arXiv Detail & Related papers (2023-10-09T16:52:48Z)
Soft Decomposed Policy-Critic: Bridging the Gap for Effective Continuous Control with Discrete RL [47.80205106726076]
We present the Soft Decomposed Policy-Critic (SDPC) architecture, which combines soft RL and actor-critic techniques with discrete RL methods to overcome this limitation. SDPC discretizes each action dimension independently and employs a shared critic network to maximize the soft $Q$-function. Our proposed approach outperforms state-of-the-art continuous RL algorithms in a variety of continuous control tasks, including Mujoco's Humanoid and Box2d's BiWalker.
arXiv Detail & Related papers (2023-08-20T08:32:11Z)
PAC-Bayesian Soft Actor-Critic Learning [9.752336113724928]
Actor-critic algorithms address the dual goals of reinforcement learning (RL), policy evaluation and improvement via two separate function approximators. We tackle this bottleneck by employing an existing Probably Approximately Correct (PAC) Bayesian bound for the first time as the critic training objective of the Soft Actor-Critic (SAC) algorithm.
arXiv Detail & Related papers (2023-01-30T10:44:15Z)
Dealing with Sparse Rewards in Continuous Control Robotics via Heavy-Tailed Policies [64.2210390071609]
We present a novel Heavy-Tailed Policy Gradient (HT-PSG) algorithm to deal with the challenges of sparse rewards in continuous control problems. We show consistent performance improvement across all tasks in terms of high average cumulative reward.
arXiv Detail & Related papers (2022-06-12T04:09:39Z)
Evolving Pareto-Optimal Actor-Critic Algorithms for Generalizability and Stability [67.8426046908398]
Generalizability and stability are two key objectives for operating reinforcement learning (RL) agents in the real world. This paper presents MetaPG, an evolutionary method for automated design of actor-critic loss functions.
arXiv Detail & Related papers (2022-04-08T20:46:16Z)
Zeroth-Order Actor-Critic: An Evolutionary Framework for Sequential Decision Problems [17.713459311502636]
We propose a novel evolutionary framework Zeroth-Order Actor-Critic (ZOAC) to solve sequential decision problems (SDPs) ZOAC uses step-wise exploration in parameter space and theoretically derive the zeroth-order policy gradient. It significantly outperforms EAs that treat the problem as static optimization and matches the performance of gradient-based RL methods even without first-order information.
arXiv Detail & Related papers (2022-01-29T07:09:03Z)
Doubly Robust Off-Policy Actor-Critic: Convergence and Optimality [131.45028999325797]
We develop a doubly robust off-policy AC (DR-Off-PAC) for discounted MDP. DR-Off-PAC adopts a single timescale structure, in which both actor and critics are updated simultaneously with constant stepsize. We study the finite-time convergence rate and characterize the sample complexity for DR-Off-PAC to attain an $epsilon$-accurate optimal policy.
arXiv Detail & Related papers (2021-02-23T18:56:13Z)
OffCon$^3$: What is state of the art anyway? [20.59974596074688]
Two popular approaches to model-free continuous control tasks are SAC and TD3. TD3 is derived from DPG, which uses a deterministic policy to perform policy ascent along the value function. OffCon$3$ is a code base featuring state-of-the-art versions of both algorithms.
arXiv Detail & Related papers (2021-01-27T11:45:08Z)
Band-limited Soft Actor Critic Model [15.11069042369131]
Soft Actor Critic (SAC) algorithms show remarkable performance in complex simulated environments. We take this idea one step further by artificially bandlimiting the target critic spatial resolution. We derive the closed form solution in the linear case and show that bandlimiting reduces the interdependency between the low frequency components of the state-action value approximation.
arXiv Detail & Related papers (2020-06-19T22:52:43Z)
Kalman meets Bellman: Improving Policy Evaluation through Value Tracking [59.691919635037216]
Policy evaluation is a key process in Reinforcement Learning (RL) We devise an optimization method, called Kalman Optimization for Value Approximation (KOVA) KOVA minimizes a regularized objective function that concerns both parameter and noisy return uncertainties.
arXiv Detail & Related papers (2020-02-17T13:30:43Z)
Discrete Action On-Policy Learning with Action-Value Critic [72.20609919995086]
Reinforcement learning (RL) in discrete action space is ubiquitous in real-world applications, but its complexity grows exponentially with the action-space dimension. We construct a critic to estimate action-value functions, apply it on correlated actions, and combine these critic estimated action values to control the variance of gradient estimation. These efforts result in a new discrete action on-policy RL algorithm that empirically outperforms related on-policy algorithms relying on variance control techniques.
arXiv Detail & Related papers (2020-02-10T04:23:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.