DoMo-AC: Doubly Multi-step Off-policy Actor-Critic Algorithm
- URL: http://arxiv.org/abs/2305.18501v1
- Date: Mon, 29 May 2023 14:36:51 GMT
- Title: DoMo-AC: Doubly Multi-step Off-policy Actor-Critic Algorithm
- Authors: Yunhao Tang, Tadashi Kozuno, Mark Rowland, Anna Harutyunyan, R\'emi
Munos, Bernardo \'Avila Pires, Michal Valko
- Abstract summary: We introduce doubly multi-step off-policy VI (DoMo-VI), a novel oracle algorithm that combines multi-step policy improvements and policy evaluations.
We then propose doubly multi-step off-policy actor-critic (DoMo-AC), a practical instantiation of the DoMo-VI algorithm.
- Score: 48.60180355291149
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multi-step learning applies lookahead over multiple time steps and has proved
valuable in policy evaluation settings. However, in the optimal control case,
the impact of multi-step learning has been relatively limited despite a number
of prior efforts. Fundamentally, this might be because multi-step policy
improvements require operations that cannot be approximated by stochastic
samples, hence hindering the widespread adoption of such methods in practice.
To address such limitations, we introduce doubly multi-step off-policy VI
(DoMo-VI), a novel oracle algorithm that combines multi-step policy
improvements and policy evaluations. DoMo-VI enjoys guaranteed convergence
speed-up to the optimal policy and is applicable in general off-policy learning
settings. We then propose doubly multi-step off-policy actor-critic (DoMo-AC),
a practical instantiation of the DoMo-VI algorithm. DoMo-AC introduces a
bias-variance trade-off that ensures improved policy gradient estimates. When
combined with the IMPALA architecture, DoMo-AC has showed improvements over the
baseline algorithm on Atari-57 game benchmarks.
Related papers
- Zeroth-Order Actor-Critic [6.5158195776494]
We propose Zeroth-Order Actor-Critic algorithm (ZOAC) that unifies these two methods into an on-policy actor-critic architecture.
We evaluate our proposed method on a range of challenging continuous control benchmarks using different types of policies, where ZOAC outperforms zeroth-order and first-order baseline algorithms.
arXiv Detail & Related papers (2022-01-29T07:09:03Z) - Offline RL Without Off-Policy Evaluation [49.11859771578969]
We show that simply doing one step of constrained/regularized policy improvement using an on-policy Q estimate of the behavior policy performs surprisingly well.
This one-step algorithm beats the previously reported results of iterative algorithms on a large portion of the D4RL benchmark.
arXiv Detail & Related papers (2021-06-16T16:04:26Z) - Semi-On-Policy Training for Sample Efficient Multi-Agent Policy
Gradients [51.749831824106046]
We introduce semi-on-policy (SOP) training as an effective and computationally efficient way to address the sample inefficiency of on-policy policy gradient methods.
We show that our methods perform as well or better than state-of-the-art value-based methods on a variety of SMAC tasks.
arXiv Detail & Related papers (2021-04-27T19:37:01Z) - Addressing Action Oscillations through Learning Policy Inertia [26.171039226334504]
Policy Inertia Controller (PIC) serves as a generic plug-in framework to off-the-shelf DRL algorithms.
We propose Nested Policy Iteration as a general training algorithm for PIC-augmented policy.
We derive a practical DRL algorithm, namely Nested Soft Actor-Critic.
arXiv Detail & Related papers (2021-03-03T09:59:43Z) - Optimization Issues in KL-Constrained Approximate Policy Iteration [48.24321346619156]
Many reinforcement learning algorithms can be seen as versions of approximate policy iteration (API)
While standard API often performs poorly, it has been shown that learning can be stabilized by regularizing each policy update by the KL-divergence to the previous policy.
Popular practical algorithms such as TRPO, MPO, and VMPO replace regularization by a constraint on KL-divergence of consecutive policies.
arXiv Detail & Related papers (2021-02-11T19:35:33Z) - Variance-Reduced Off-Policy Memory-Efficient Policy Search [61.23789485979057]
Off-policy policy optimization is a challenging problem in reinforcement learning.
Off-policy algorithms are memory-efficient and capable of learning from off-policy samples.
arXiv Detail & Related papers (2020-09-14T16:22:46Z) - Off-Policy Multi-Agent Decomposed Policy Gradients [30.389041305278045]
We investigate causes that hinder the performance of MAPG algorithms and present a multi-agent decomposed policy gradient method (DOP)
DOP supports efficient off-policy learning and addresses the issue of centralized-decentralized mismatch and credit assignment.
In addition, empirical evaluations on the StarCraft II micromanagement benchmark and multi-agent particle environments demonstrate that DOP significantly outperforms both state-of-the-art value-based and policy-based multi-agent reinforcement learning algorithms.
arXiv Detail & Related papers (2020-07-24T02:21:55Z) - FACMAC: Factored Multi-Agent Centralised Policy Gradients [103.30380537282517]
We propose FACtored Multi-Agent Centralised policy gradients (FACMAC)
It is a new method for cooperative multi-agent reinforcement learning in both discrete and continuous action spaces.
We evaluate FACMAC on variants of the multi-agent particle environments, a novel multi-agent MuJoCo benchmark, and a challenging set of StarCraft II micromanagement tasks.
arXiv Detail & Related papers (2020-03-14T21:29:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.