Related papers: DoMo-AC: Doubly Multi-step Off-policy Actor-Critic Algorithm

DoMo-AC: Doubly Multi-step Off-policy Actor-Critic Algorithm

URL: http://arxiv.org/abs/2305.18501v1
Date: Mon, 29 May 2023 14:36:51 GMT
Title: DoMo-AC: Doubly Multi-step Off-policy Actor-Critic Algorithm
Authors: Yunhao Tang, Tadashi Kozuno, Mark Rowland, Anna Harutyunyan, R\'emi Munos, Bernardo \'Avila Pires, Michal Valko
Abstract summary: We introduce doubly multi-step off-policy VI (DoMo-VI), a novel oracle algorithm that combines multi-step policy improvements and policy evaluations. We then propose doubly multi-step off-policy actor-critic (DoMo-AC), a practical instantiation of the DoMo-VI algorithm.
Score: 48.60180355291149
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multi-step learning applies lookahead over multiple time steps and has proved valuable in policy evaluation settings. However, in the optimal control case, the impact of multi-step learning has been relatively limited despite a number of prior efforts. Fundamentally, this might be because multi-step policy improvements require operations that cannot be approximated by stochastic samples, hence hindering the widespread adoption of such methods in practice. To address such limitations, we introduce doubly multi-step off-policy VI (DoMo-VI), a novel oracle algorithm that combines multi-step policy improvements and policy evaluations. DoMo-VI enjoys guaranteed convergence speed-up to the optimal policy and is applicable in general off-policy learning settings. We then propose doubly multi-step off-policy actor-critic (DoMo-AC), a practical instantiation of the DoMo-VI algorithm. DoMo-AC introduces a bias-variance trade-off that ensures improved policy gradient estimates. When combined with the IMPALA architecture, DoMo-AC has showed improvements over the baseline algorithm on Atari-57 game benchmarks.

Related papers

Q-Pensieve: Boosting Sample Efficiency of Multi-Objective RL Through Memory Sharing of Q-Snapshots [11.533449955841968]
We propose Q-Pensieve, a policy improvement scheme that stores a collection of Q-snapshots to jointly determine the policy update direction. We show that Q-Pensieve can be naturally integrated with soft policy iteration with convergence guarantee.
arXiv Detail & Related papers (2022-12-06T16:29:47Z)
Zeroth-Order Actor-Critic [6.5158195776494]
We propose Zeroth-Order Actor-Critic algorithm (ZOAC) that unifies these two methods into an on-policy actor-critic architecture. We evaluate our proposed method on a range of challenging continuous control benchmarks using different types of policies, where ZOAC outperforms zeroth-order and first-order baseline algorithms.
arXiv Detail & Related papers (2022-01-29T07:09:03Z)
Offline RL Without Off-Policy Evaluation [49.11859771578969]
We show that simply doing one step of constrained/regularized policy improvement using an on-policy Q estimate of the behavior policy performs surprisingly well. This one-step algorithm beats the previously reported results of iterative algorithms on a large portion of the D4RL benchmark.
arXiv Detail & Related papers (2021-06-16T16:04:26Z)
Semi-On-Policy Training for Sample Efficient Multi-Agent Policy Gradients [51.749831824106046]
We introduce semi-on-policy (SOP) training as an effective and computationally efficient way to address the sample inefficiency of on-policy policy gradient methods. We show that our methods perform as well or better than state-of-the-art value-based methods on a variety of SMAC tasks.
arXiv Detail & Related papers (2021-04-27T19:37:01Z)
Addressing Action Oscillations through Learning Policy Inertia [26.171039226334504]
Policy Inertia Controller (PIC) serves as a generic plug-in framework to off-the-shelf DRL algorithms. We propose Nested Policy Iteration as a general training algorithm for PIC-augmented policy. We derive a practical DRL algorithm, namely Nested Soft Actor-Critic.
arXiv Detail & Related papers (2021-03-03T09:59:43Z)
Optimization Issues in KL-Constrained Approximate Policy Iteration [48.24321346619156]
Many reinforcement learning algorithms can be seen as versions of approximate policy iteration (API) While standard API often performs poorly, it has been shown that learning can be stabilized by regularizing each policy update by the KL-divergence to the previous policy. Popular practical algorithms such as TRPO, MPO, and VMPO replace regularization by a constraint on KL-divergence of consecutive policies.
arXiv Detail & Related papers (2021-02-11T19:35:33Z)
Variance-Reduced Off-Policy Memory-Efficient Policy Search [61.23789485979057]
Off-policy policy optimization is a challenging problem in reinforcement learning. Off-policy algorithms are memory-efficient and capable of learning from off-policy samples.
arXiv Detail & Related papers (2020-09-14T16:22:46Z)
Off-Policy Multi-Agent Decomposed Policy Gradients [30.389041305278045]
We investigate causes that hinder the performance of MAPG algorithms and present a multi-agent decomposed policy gradient method (DOP) DOP supports efficient off-policy learning and addresses the issue of centralized-decentralized mismatch and credit assignment. In addition, empirical evaluations on the StarCraft II micromanagement benchmark and multi-agent particle environments demonstrate that DOP significantly outperforms both state-of-the-art value-based and policy-based multi-agent reinforcement learning algorithms.
arXiv Detail & Related papers (2020-07-24T02:21:55Z)
FACMAC: Factored Multi-Agent Centralised Policy Gradients [103.30380537282517]
We propose FACtored Multi-Agent Centralised policy gradients (FACMAC) It is a new method for cooperative multi-agent reinforcement learning in both discrete and continuous action spaces. We evaluate FACMAC on variants of the multi-agent particle environments, a novel multi-agent MuJoCo benchmark, and a challenging set of StarCraft II micromanagement tasks.
arXiv Detail & Related papers (2020-03-14T21:29:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.