Online Meta-Critic Learning for Off-Policy Actor-Critic Methods
- URL: http://arxiv.org/abs/2003.05334v2
- Date: Mon, 2 Nov 2020 04:53:38 GMT
- Title: Online Meta-Critic Learning for Off-Policy Actor-Critic Methods
- Authors: Wei Zhou, Yiying Li, Yongxin Yang, Huaimin Wang, Timothy M. Hospedales
- Abstract summary: Off-Policy Actor-Critic (Off-PAC) methods have proven successful in a variety of continuous control tasks.
We introduce a novel and flexible meta-critic that observes the learning process and meta-learns an additional loss for the actor.
- Score: 107.98781730288897
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Off-Policy Actor-Critic (Off-PAC) methods have proven successful in a variety
of continuous control tasks. Normally, the critic's action-value function is
updated using temporal-difference, and the critic in turn provides a loss for
the actor that trains it to take actions with higher expected return. In this
paper, we introduce a novel and flexible meta-critic that observes the learning
process and meta-learns an additional loss for the actor that accelerates and
improves actor-critic learning. Compared to the vanilla critic, the meta-critic
network is explicitly trained to accelerate the learning process; and compared
to existing meta-learning algorithms, meta-critic is rapidly learned online for
a single task, rather than slowly over a family of tasks. Crucially, our
meta-critic framework is designed for off-policy based learners, which
currently provide state-of-the-art reinforcement learning sample efficiency. We
demonstrate that online meta-critic learning leads to improvements in avariety
of continuous control environments when combined with contemporary Off-PAC
methods DDPG, TD3 and the state-of-the-art SAC.
Related papers
- Efficient Offline Reinforcement Learning: The Critic is Critical [5.916429671763282]
Off-policy reinforcement learning provides a promising approach for improving performance beyond supervised approaches.
We propose a best-of-both approach by first learning the behavior policy and critic with supervised learning, before improving with off-policy reinforcement learning.
arXiv Detail & Related papers (2024-06-19T09:16:38Z) - PAC-Bayesian Soft Actor-Critic Learning [9.752336113724928]
Actor-critic algorithms address the dual goals of reinforcement learning (RL), policy evaluation and improvement via two separate function approximators.
We tackle this bottleneck by employing an existing Probably Approximately Correct (PAC) Bayesian bound for the first time as the critic training objective of the Soft Actor-Critic (SAC) algorithm.
arXiv Detail & Related papers (2023-01-30T10:44:15Z) - Solving Continuous Control via Q-learning [54.05120662838286]
We show that a simple modification of deep Q-learning largely alleviates issues with actor-critic methods.
By combining bang-bang action discretization with value decomposition, framing single-agent control as cooperative multi-agent reinforcement learning (MARL), this simple critic-only approach matches performance of state-of-the-art continuous actor-critic methods.
arXiv Detail & Related papers (2022-10-22T22:55:50Z) - Meta-Learning with Self-Improving Momentum Target [72.98879709228981]
We propose Self-improving Momentum Target (SiMT) to improve the performance of a meta-learner.
SiMT generates the target model by adapting from the temporal ensemble of the meta-learner.
We show that SiMT brings a significant performance gain when combined with a wide range of meta-learning methods.
arXiv Detail & Related papers (2022-10-11T06:45:15Z) - On the Effectiveness of Fine-tuning Versus Meta-reinforcement Learning [71.55412580325743]
We show that multi-task pretraining with fine-tuning on new tasks performs equally as well, or better, than meta-pretraining with meta test-time adaptation.
This is encouraging for future research, as multi-task pretraining tends to be simpler and computationally cheaper than meta-RL.
arXiv Detail & Related papers (2022-06-07T13:24:00Z) - TASAC: a twin-actor reinforcement learning framework with stochastic
policy for batch process control [1.101002667958165]
Reinforcement Learning (RL) wherein an agent learns the policy by directly interacting with the environment, offers a potential alternative in this context.
RL frameworks with actor-critic architecture have recently become popular for controlling systems where state and action spaces are continuous.
It has been shown that an ensemble of actor and critic networks further helps the agent learn better policies due to the enhanced exploration due to simultaneous policy learning.
arXiv Detail & Related papers (2022-04-22T13:00:51Z) - Off-policy Reinforcement Learning with Optimistic Exploration and
Distribution Correction [73.77593805292194]
We train a separate exploration policy to maximize an approximate upper confidence bound of the critics in an off-policy actor-critic framework.
To mitigate the off-policy-ness, we adapt the recently introduced DICE framework to learn a distribution correction ratio for off-policy actor-critic training.
arXiv Detail & Related papers (2021-10-22T22:07:51Z) - GRAC: Self-Guided and Self-Regularized Actor-Critic [24.268453994605512]
We propose a self-regularized TD-learning method to address divergence without requiring a target network.
We also propose a self-guided policy improvement method by combining policy-gradient with zero-order optimization.
This makes learning more robust to local noise in the Q function approximation and guides the updates of our actor network.
We evaluate GRAC on the suite of OpenAI gym tasks, achieving or outperforming state of the art in every environment tested.
arXiv Detail & Related papers (2020-09-18T17:58:29Z) - Meta-Gradient Reinforcement Learning with an Objective Discovered Online [54.15180335046361]
We propose an algorithm based on meta-gradient descent that discovers its own objective, flexibly parameterised by a deep neural network.
Because the objective is discovered online, it can adapt to changes over time.
On the Atari Learning Environment, the meta-gradient algorithm adapts over time to learn with greater efficiency.
arXiv Detail & Related papers (2020-07-16T16:17:09Z) - How to Learn a Useful Critic? Model-based Action-Gradient-Estimator
Policy Optimization [10.424426548124696]
We propose MAGE, a model-based actor-critic algorithm, grounded in the theory of policy gradients.
MAGE backpropagates through the learned dynamics to compute gradient targets in temporal difference learning.
We demonstrate the efficiency of the algorithm in comparison to model-free and model-based state-of-the-art baselines.
arXiv Detail & Related papers (2020-04-29T16:30:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.