Cautious Actor-Critic
- URL: http://arxiv.org/abs/2107.05217v1
- Date: Mon, 12 Jul 2021 06:40:02 GMT
- Title: Cautious Actor-Critic
- Authors: Lingwei Zhu, Toshinori Kitamura, Takamitsu Matsubara
- Abstract summary: We propose a novel off-policy AC algorithm cautious actor-critic (CAC)
We show that CAC achieves comparable performance while significantly stabilizes learning.
- Score: 11.82492300303637
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The oscillating performance of off-policy learning and persisting errors in
the actor-critic (AC) setting call for algorithms that can conservatively learn
to suit the stability-critical applications better. In this paper, we propose a
novel off-policy AC algorithm cautious actor-critic (CAC). The name cautious
comes from the doubly conservative nature that we exploit the classic policy
interpolation from conservative policy iteration for the actor and the
entropy-regularization of conservative value iteration for the critic. Our key
observation is the entropy-regularized critic facilitates and simplifies the
unwieldy interpolated actor update while still ensuring robust policy
improvement. We compare CAC to state-of-the-art AC methods on a set of
challenging continuous control problems and demonstrate that CAC achieves
comparable performance while significantly stabilizes learning.
Related papers
- Functional Critic Modeling for Provably Convergent Off-Policy Actor-Critic [29.711769434073755]
We introduce a novel concept of functional critic modeling, which leads to a new AC framework.<n>We provide a theoretical analysis in the linear function setting, establishing the provable convergence of our framework.
arXiv Detail & Related papers (2025-09-26T21:55:26Z) - Decision-Aware Actor-Critic with Function Approximation and Theoretical
Guarantees [12.259191000019033]
Actor-critic (AC) methods are widely used in reinforcement learning (RL)
We design a joint objective for training the actor and critic in a decision-aware fashion.
We empirically demonstrate the benefit of our decision-aware actor-critic framework on simple RL problems.
arXiv Detail & Related papers (2023-05-24T15:34:21Z) - PAC-Bayesian Soft Actor-Critic Learning [9.752336113724928]
Actor-critic algorithms address the dual goals of reinforcement learning (RL), policy evaluation and improvement via two separate function approximators.
We tackle this bottleneck by employing an existing Probably Approximately Correct (PAC) Bayesian bound for the first time as the critic training objective of the Soft Actor-Critic (SAC) algorithm.
arXiv Detail & Related papers (2023-01-30T10:44:15Z) - Solving Continuous Control via Q-learning [54.05120662838286]
We show that a simple modification of deep Q-learning largely alleviates issues with actor-critic methods.
By combining bang-bang action discretization with value decomposition, framing single-agent control as cooperative multi-agent reinforcement learning (MARL), this simple critic-only approach matches performance of state-of-the-art continuous actor-critic methods.
arXiv Detail & Related papers (2022-10-22T22:55:50Z) - Actor-Critic based Improper Reinforcement Learning [61.430513757337486]
We consider an improper reinforcement learning setting where a learner is given $M$ base controllers for an unknown Markov decision process.
We propose two algorithms: (1) a Policy Gradient-based approach; and (2) an algorithm that can switch between a simple Actor-Critic scheme and a Natural Actor-Critic scheme.
arXiv Detail & Related papers (2022-07-19T05:55:02Z) - Off-policy Reinforcement Learning with Optimistic Exploration and
Distribution Correction [73.77593805292194]
We train a separate exploration policy to maximize an approximate upper confidence bound of the critics in an off-policy actor-critic framework.
To mitigate the off-policy-ness, we adapt the recently introduced DICE framework to learn a distribution correction ratio for off-policy actor-critic training.
arXiv Detail & Related papers (2021-10-22T22:07:51Z) - Characterizing the Gap Between Actor-Critic and Policy Gradient [47.77939973964009]
We explain the gap between AC and PG methods by identifying the exact adjustment to the AC objective/gradient.
We develop practical algorithms, Residual Actor-Critic and Stackelberg Actor-Critic, for estimating the correction between AC and PG.
arXiv Detail & Related papers (2021-06-13T06:35:42Z) - Doubly Robust Off-Policy Actor-Critic: Convergence and Optimality [131.45028999325797]
We develop a doubly robust off-policy AC (DR-Off-PAC) for discounted MDP.
DR-Off-PAC adopts a single timescale structure, in which both actor and critics are updated simultaneously with constant stepsize.
We study the finite-time convergence rate and characterize the sample complexity for DR-Off-PAC to attain an $epsilon$-accurate optimal policy.
arXiv Detail & Related papers (2021-02-23T18:56:13Z) - How to Learn a Useful Critic? Model-based Action-Gradient-Estimator
Policy Optimization [10.424426548124696]
We propose MAGE, a model-based actor-critic algorithm, grounded in the theory of policy gradients.
MAGE backpropagates through the learned dynamics to compute gradient targets in temporal difference learning.
We demonstrate the efficiency of the algorithm in comparison to model-free and model-based state-of-the-art baselines.
arXiv Detail & Related papers (2020-04-29T16:30:53Z) - Online Meta-Critic Learning for Off-Policy Actor-Critic Methods [107.98781730288897]
Off-Policy Actor-Critic (Off-PAC) methods have proven successful in a variety of continuous control tasks.
We introduce a novel and flexible meta-critic that observes the learning process and meta-learns an additional loss for the actor.
arXiv Detail & Related papers (2020-03-11T14:39:49Z) - Greedy Actor-Critic: A New Conditional Cross-Entropy Method for Policy
Improvement [31.602912612167856]
In this work, we explore an alternative update for the actor, based on an extension of the cross entropy method (CEM) to condition on inputs (states)
The speed of this concentration is controlled by a proposal policy, that concentrates at a slower rate than the actor.
We empirically show that our Greedy AC algorithm, that uses CCEM for the actor update, performs better than Soft Actor-Critic and is much less sensitive to entropy-regularization.
arXiv Detail & Related papers (2018-10-22T06:35:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.