Trust Region Policy Optimisation in Multi-Agent Reinforcement Learning
- URL: http://arxiv.org/abs/2109.11251v1
- Date: Thu, 23 Sep 2021 09:44:35 GMT
- Title: Trust Region Policy Optimisation in Multi-Agent Reinforcement Learning
- Authors: Jakub Grudzien Kuba, Ruiqing Chen, Munning Wen, Ying Wen, Fanglei Sun,
Jun Wang, Yaodong Yang
- Abstract summary: Trust region methods rigorously enabled reinforcement learning (RL) agents to learn monotonically improving policies, leading to superior performance on a variety of tasks.
Unfortunately, when it comes to multi-agent reinforcement learning (MARL), the property of monotonic improvement may not simply apply.
In this paper, we extend the theory of trust region learning to MARL. Central to our findings are the multi-agent advantage decomposition lemma and the sequential policy update scheme.
Based on these, we develop Heterogeneous-Agent Trust Region Policy optimisation (HATPRO) and Heterogeneous-Agent Proximal Policy optimisation (
- Score: 25.027143431992755
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Trust region methods rigorously enabled reinforcement learning (RL) agents to
learn monotonically improving policies, leading to superior performance on a
variety of tasks. Unfortunately, when it comes to multi-agent reinforcement
learning (MARL), the property of monotonic improvement may not simply apply;
this is because agents, even in cooperative games, could have conflicting
directions of policy updates. As a result, achieving a guaranteed improvement
on the joint policy where each agent acts individually remains an open
challenge. In this paper, we extend the theory of trust region learning to
MARL. Central to our findings are the multi-agent advantage decomposition lemma
and the sequential policy update scheme. Based on these, we develop
Heterogeneous-Agent Trust Region Policy Optimisation (HATPRO) and
Heterogeneous-Agent Proximal Policy Optimisation (HAPPO) algorithms. Unlike
many existing MARL algorithms, HATRPO/HAPPO do not need agents to share
parameters, nor do they need any restrictive assumptions on decomposibility of
the joint value function. Most importantly, we justify in theory the monotonic
improvement property of HATRPO/HAPPO. We evaluate the proposed methods on a
series of Multi-Agent MuJoCo and StarCraftII tasks. Results show that HATRPO
and HAPPO significantly outperform strong baselines such as IPPO, MAPPO and
MADDPG on all tested tasks, therefore establishing a new state of the art.
Related papers
- From Novice to Expert: LLM Agent Policy Optimization via Step-wise Reinforcement Learning [62.54484062185869]
We introduce StepAgent, which utilizes step-wise reward to optimize the agent's reinforcement learning process.
We propose implicit-reward and inverse reinforcement learning techniques to facilitate agent reflection and policy adjustment.
arXiv Detail & Related papers (2024-11-06T10:35:11Z) - Robust Multi-Agent Reinforcement Learning via Adversarial
Regularization: Theoretical Foundation and Stable Algorithms [79.61176746380718]
Multi-Agent Reinforcement Learning (MARL) has shown promising results across several domains.
MARL policies often lack robustness and are sensitive to small changes in their environment.
We show that we can gain robustness by controlling a policy's Lipschitz constant.
We propose a new robust MARL framework, ERNIE, that promotes the Lipschitz continuity of the policies.
arXiv Detail & Related papers (2023-10-16T20:14:06Z) - Deep Multi-Agent Reinforcement Learning for Decentralized Active
Hypothesis Testing [11.639503711252663]
We tackle the multi-agent active hypothesis testing (AHT) problem by introducing a novel algorithm rooted in the framework of deep multi-agent reinforcement learning.
We present a comprehensive set of experimental results that effectively showcase the agents' ability to learn collaborative strategies and enhance performance.
arXiv Detail & Related papers (2023-09-14T01:18:04Z) - Heterogeneous Multi-Agent Reinforcement Learning via Mirror Descent
Policy Optimization [1.5501208213584152]
This paper presents an extension of the Mirror Descent method to overcome challenges in cooperative Multi-Agent Reinforcement Learning (MARL) settings.
The proposed Heterogeneous-Agent Mirror Descent Policy Optimization (HAMDPO) algorithm utilizes the multi-agent advantage decomposition lemma.
We evaluate HAMDPO on Multi-Agent MuJoCo and StarCraftII tasks, demonstrating its superiority over state-of-the-art algorithms.
arXiv Detail & Related papers (2023-08-13T10:18:10Z) - Local Optimization Achieves Global Optimality in Multi-Agent
Reinforcement Learning [139.53668999720605]
We present a multi-agent PPO algorithm in which the local policy of each agent is updated similarly to vanilla PPO.
We prove that with standard regularity conditions on the Markov game and problem-dependent quantities, our algorithm converges to the globally optimal policy at a sublinear rate.
arXiv Detail & Related papers (2023-05-08T16:20:03Z) - Order Matters: Agent-by-agent Policy Optimization [41.017093493743765]
A sequential scheme that updates policies agent-by-agent provides another perspective and shows strong performance.
We propose the textbfAgent-by-textbfagent textbfPolicy textbfOptimization (A2PO) algorithm to improve the sample efficiency.
arXiv Detail & Related papers (2023-02-13T09:24:34Z) - Monotonic Improvement Guarantees under Non-stationarity for
Decentralized PPO [66.5384483339413]
We present a new monotonic improvement guarantee for optimizing decentralized policies in cooperative Multi-Agent Reinforcement Learning (MARL)
We show that a trust region constraint can be effectively enforced in a principled way by bounding independent ratios based on the number of agents in training.
arXiv Detail & Related papers (2022-01-31T20:39:48Z) - Dealing with Non-Stationarity in Multi-Agent Reinforcement Learning via
Trust Region Decomposition [52.06086375833474]
Non-stationarity is one thorny issue in multi-agent reinforcement learning.
We introduce a $delta$-stationarity measurement to explicitly model the stationarity of a policy sequence.
We propose a trust region decomposition network based on message passing to estimate the joint policy divergence.
arXiv Detail & Related papers (2021-02-21T14:46:50Z) - Multi-Agent Trust Region Policy Optimization [34.91180300856614]
We show that the policy update of TRPO can be transformed into a distributed consensus optimization problem for multi-agent cases.
We propose a decentralized MARL algorithm, which we call multi-agent TRPO (MATRPO)
arXiv Detail & Related papers (2020-10-15T17:49:47Z) - FACMAC: Factored Multi-Agent Centralised Policy Gradients [103.30380537282517]
We propose FACtored Multi-Agent Centralised policy gradients (FACMAC)
It is a new method for cooperative multi-agent reinforcement learning in both discrete and continuous action spaces.
We evaluate FACMAC on variants of the multi-agent particle environments, a novel multi-agent MuJoCo benchmark, and a challenging set of StarCraft II micromanagement tasks.
arXiv Detail & Related papers (2020-03-14T21:29:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.