Settling the Variance of Multi-Agent Policy Gradients
- URL: http://arxiv.org/abs/2108.08612v2
- Date: Fri, 20 Aug 2021 10:03:19 GMT
- Title: Settling the Variance of Multi-Agent Policy Gradients
- Authors: Jakub Grudzien Kuba, Muning Wen, Yaodong Yang, Linghui Meng, Shangding
Gu, Haifeng Zhang, David Henry Mguni, Jun Wang
- Abstract summary: Policy gradient (PG) methods are popular reinforcement learning (RL) methods.
In multi-agent RL (MARL), although the PG theorem can be naturally extended, the effectiveness of multi-agent PG methods degrades as the variance of gradient estimates increases rapidly with the number of agents.
We offer a rigorous analysis of MAPG methods by quantifying the contributions of the number of agents and agents' explorations to the variance of MAPG estimators.
We propose a surrogate version of OB, which can be seamlessly plugged into any existing PG methods in MARL.
- Score: 14.558011059649543
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Policy gradient (PG) methods are popular reinforcement learning (RL) methods
where a baseline is often applied to reduce the variance of gradient estimates.
In multi-agent RL (MARL), although the PG theorem can be naturally extended,
the effectiveness of multi-agent PG (MAPG) methods degrades as the variance of
gradient estimates increases rapidly with the number of agents. In this paper,
we offer a rigorous analysis of MAPG methods by, firstly, quantifying the
contributions of the number of agents and agents' explorations to the variance
of MAPG estimators. Based on this analysis, we derive the optimal baseline (OB)
that achieves the minimal variance. In comparison to the OB, we measure the
excess variance of existing MARL algorithms such as vanilla MAPG and COMA.
Considering using deep neural networks, we also propose a surrogate version of
OB, which can be seamlessly plugged into any existing PG methods in MARL. On
benchmarks of Multi-Agent MuJoCo and StarCraft challenges, our OB technique
effectively stabilises training and improves the performance of multi-agent PPO
and COMA algorithms by a significant margin.
Related papers
- Policy Gradient with Active Importance Sampling [55.112959067035916]
Policy gradient (PG) methods significantly benefit from IS, enabling the effective reuse of previously collected samples.
However, IS is employed in RL as a passive tool for re-weighting historical samples.
We look for the best behavioral policy from which to collect samples to reduce the policy gradient variance.
arXiv Detail & Related papers (2024-05-09T09:08:09Z) - Measuring Policy Distance for Multi-Agent Reinforcement Learning [9.80588687020087]
We propose the multi-agent policy distance (MAPD), a tool for measuring policy differences in multi-agent reinforcement learning (MARL)
By learning the conditional representations of agents' decisions, MAPD can compute the policy distance between any pair of agents.
We also extend MAPD to a customizable version, which can quantify differences among agent policies on specified aspects.
arXiv Detail & Related papers (2024-01-20T15:34:51Z) - Model-Based Reparameterization Policy Gradient Methods: Theory and
Practical Algorithms [88.74308282658133]
Reization (RP) Policy Gradient Methods (PGMs) have been widely adopted for continuous control tasks in robotics and computer graphics.
Recent studies have revealed that, when applied to long-term reinforcement learning problems, model-based RP PGMs may experience chaotic and non-smooth optimization landscapes.
We propose a spectral normalization method to mitigate the exploding variance issue caused by long model unrolls.
arXiv Detail & Related papers (2023-10-30T18:43:21Z) - Relational Reasoning via Set Transformers: Provable Efficiency and
Applications to MARL [154.13105285663656]
A cooperative Multi-A gent R einforcement Learning (MARL) with permutation invariant agents framework has achieved tremendous empirical successes in real-world applications.
Unfortunately, the theoretical understanding of this MARL problem is lacking due to the curse of many agents and the limited exploration of the relational reasoning in existing works.
We prove that the suboptimality gaps of the model-free and model-based algorithms are independent of and logarithmic in the number of agents respectively, which mitigates the curse of many agents.
arXiv Detail & Related papers (2022-09-20T16:42:59Z) - Towards Global Optimality in Cooperative MARL with the Transformation
And Distillation Framework [26.612749327414335]
Decentralized execution is one core demand in cooperative multi-agent reinforcement learning (MARL)
In this paper, we theoretically analyze two common classes of algorithms with decentralized policies -- multi-agent policy gradient methods and value-decomposition methods.
We show that TAD-PPO can theoretically perform optimal policy learning in the finite multi-agent MDPs and shows significant outperformance on a large set of cooperative multi-agent tasks.
arXiv Detail & Related papers (2022-07-12T06:59:13Z) - Distributed Policy Gradient with Variance Reduction in Multi-Agent
Reinforcement Learning [7.4447396913959185]
This paper studies a distributed policy gradient in collaborative multi-agent reinforcement learning (MARL)
Agents over a communication network aim to find the optimal policy to maximize the average of all agents' local returns.
arXiv Detail & Related papers (2021-11-25T08:07:30Z) - Permutation Invariant Policy Optimization for Mean-Field Multi-Agent
Reinforcement Learning: A Principled Approach [128.62787284435007]
We propose the mean-field proximal policy optimization (MF-PPO) algorithm, at the core of which is a permutation-invariant actor-critic neural architecture.
We prove that MF-PPO attains the globally optimal policy at a sublinear rate of convergence.
In particular, we show that the inductive bias introduced by the permutation-invariant neural architecture enables MF-PPO to outperform existing competitors.
arXiv Detail & Related papers (2021-05-18T04:35:41Z) - Semi-On-Policy Training for Sample Efficient Multi-Agent Policy
Gradients [51.749831824106046]
We introduce semi-on-policy (SOP) training as an effective and computationally efficient way to address the sample inefficiency of on-policy policy gradient methods.
We show that our methods perform as well or better than state-of-the-art value-based methods on a variety of SMAC tasks.
arXiv Detail & Related papers (2021-04-27T19:37:01Z) - Off-Policy Multi-Agent Decomposed Policy Gradients [30.389041305278045]
We investigate causes that hinder the performance of MAPG algorithms and present a multi-agent decomposed policy gradient method (DOP)
DOP supports efficient off-policy learning and addresses the issue of centralized-decentralized mismatch and credit assignment.
In addition, empirical evaluations on the StarCraft II micromanagement benchmark and multi-agent particle environments demonstrate that DOP significantly outperforms both state-of-the-art value-based and policy-based multi-agent reinforcement learning algorithms.
arXiv Detail & Related papers (2020-07-24T02:21:55Z) - The Effect of Multi-step Methods on Overestimation in Deep Reinforcement
Learning [6.181642248900806]
Multi-step (also called n-step) methods in reinforcement learning have been shown to be more efficient than the 1-step method.
We show that both MDDPG and MMDDPG are significantly less affected by the overestimation problem than DDPG with 1-step backup.
We also discuss the advantages and disadvantages of different ways to do multi-step expansion in order to reduce approximation error.
arXiv Detail & Related papers (2020-06-23T01:35:54Z) - FACMAC: Factored Multi-Agent Centralised Policy Gradients [103.30380537282517]
We propose FACtored Multi-Agent Centralised policy gradients (FACMAC)
It is a new method for cooperative multi-agent reinforcement learning in both discrete and continuous action spaces.
We evaluate FACMAC on variants of the multi-agent particle environments, a novel multi-agent MuJoCo benchmark, and a challenging set of StarCraft II micromanagement tasks.
arXiv Detail & Related papers (2020-03-14T21:29:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.