Order Matters: Agent-by-agent Policy Optimization
- URL: http://arxiv.org/abs/2302.06205v1
- Date: Mon, 13 Feb 2023 09:24:34 GMT
- Title: Order Matters: Agent-by-agent Policy Optimization
- Authors: Xihuai Wang, Zheng Tian, Ziyu Wan, Ying Wen, Jun Wang, Weinan Zhang
- Abstract summary: A sequential scheme that updates policies agent-by-agent provides another perspective and shows strong performance.
We propose the textbfAgent-by-textbfagent textbfPolicy textbfOptimization (A2PO) algorithm to improve the sample efficiency.
- Score: 41.017093493743765
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While multi-agent trust region algorithms have achieved great success
empirically in solving coordination tasks, most of them, however, suffer from a
non-stationarity problem since agents update their policies simultaneously. In
contrast, a sequential scheme that updates policies agent-by-agent provides
another perspective and shows strong performance. However, sample inefficiency
and lack of monotonic improvement guarantees for each agent are still the two
significant challenges for the sequential scheme. In this paper, we propose the
\textbf{A}gent-by-\textbf{a}gent \textbf{P}olicy \textbf{O}ptimization (A2PO)
algorithm to improve the sample efficiency and retain the guarantees of
monotonic improvement for each agent during training. We justify the tightness
of the monotonic improvement bound compared with other trust region algorithms.
From the perspective of sequentially updating agents, we further consider the
effect of agent updating order and extend the theory of non-stationarity into
the sequential update scheme. To evaluate A2PO, we conduct a comprehensive
empirical study on four benchmarks: StarCraftII, Multi-agent MuJoCo,
Multi-agent Particle Environment, and Google Research Football full game
scenarios. A2PO consistently outperforms strong baselines.
Related papers
- Local Optimization Achieves Global Optimality in Multi-Agent
Reinforcement Learning [139.53668999720605]
We present a multi-agent PPO algorithm in which the local policy of each agent is updated similarly to vanilla PPO.
We prove that with standard regularity conditions on the Markov game and problem-dependent quantities, our algorithm converges to the globally optimal policy at a sublinear rate.
arXiv Detail & Related papers (2023-05-08T16:20:03Z) - Faster Last-iterate Convergence of Policy Optimization in Zero-Sum
Markov Games [63.60117916422867]
This paper focuses on the most basic setting of competitive multi-agent RL, namely two-player zero-sum Markov games.
We propose a single-loop policy optimization method with symmetric updates from both agents, where the policy is updated via the entropy-regularized optimistic multiplicative weights update (OMWU) method.
Our convergence results improve upon the best known complexities, and lead to a better understanding of policy optimization in competitive Markov games.
arXiv Detail & Related papers (2022-10-03T16:05:43Z) - Plan Better Amid Conservatism: Offline Multi-Agent Reinforcement
Learning with Actor Rectification [74.10976684469435]
offline reinforcement learning (RL) algorithms can be transferred to multi-agent settings directly.
We propose a simple yet effective method, Offline Multi-Agent RL with Actor Rectification (OMAR), to tackle this critical challenge.
OMAR significantly outperforms strong baselines with state-of-the-art performance in multi-agent continuous control benchmarks.
arXiv Detail & Related papers (2021-11-22T13:27:42Z) - Coordinated Proximal Policy Optimization [28.780862892562308]
Coordinated Proximal Policy Optimization (CoPPO) is an algorithm that extends the original Proximal Policy Optimization (PPO) to the multi-agent setting.
We prove the monotonicity of policy improvement when optimizing a theoretically-grounded joint objective.
We then interpret that such an objective in CoPPO can achieve dynamic credit assignment among agents, thereby alleviating the high variance issue during the concurrent update of agent policies.
arXiv Detail & Related papers (2021-11-07T11:14:19Z) - Trust Region Policy Optimisation in Multi-Agent Reinforcement Learning [25.027143431992755]
Trust region methods rigorously enabled reinforcement learning (RL) agents to learn monotonically improving policies, leading to superior performance on a variety of tasks.
Unfortunately, when it comes to multi-agent reinforcement learning (MARL), the property of monotonic improvement may not simply apply.
In this paper, we extend the theory of trust region learning to MARL. Central to our findings are the multi-agent advantage decomposition lemma and the sequential policy update scheme.
Based on these, we develop Heterogeneous-Agent Trust Region Policy optimisation (HATPRO) and Heterogeneous-Agent Proximal Policy optimisation (
arXiv Detail & Related papers (2021-09-23T09:44:35Z) - Multi-agent Policy Optimization with Approximatively Synchronous
Advantage Estimation [55.96893934962757]
In multi-agent system, polices of different agents need to be evaluated jointly.
In current methods, value functions or advantage functions use counter-factual joint actions which are evaluated asynchronously.
In this work, we propose the approximatively synchronous advantage estimation.
arXiv Detail & Related papers (2020-12-07T07:29:19Z) - Multi-Agent Trust Region Policy Optimization [34.91180300856614]
We show that the policy update of TRPO can be transformed into a distributed consensus optimization problem for multi-agent cases.
We propose a decentralized MARL algorithm, which we call multi-agent TRPO (MATRPO)
arXiv Detail & Related papers (2020-10-15T17:49:47Z) - Multiagent Value Iteration Algorithms in Dynamic Programming and
Reinforcement Learning [0.0]
We consider infinite horizon dynamic programming problems, where the control at each stage consists of several distinct decisions.
In an earlier work we introduced a policy iteration algorithm, where the policy improvement is done one-agent-at-a-time in a given order.
arXiv Detail & Related papers (2020-05-04T16:34:24Z) - FACMAC: Factored Multi-Agent Centralised Policy Gradients [103.30380537282517]
We propose FACtored Multi-Agent Centralised policy gradients (FACMAC)
It is a new method for cooperative multi-agent reinforcement learning in both discrete and continuous action spaces.
We evaluate FACMAC on variants of the multi-agent particle environments, a novel multi-agent MuJoCo benchmark, and a challenging set of StarCraft II micromanagement tasks.
arXiv Detail & Related papers (2020-03-14T21:29:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.