Dealing with Non-Stationarity in Multi-Agent Reinforcement Learning via
Trust Region Decomposition
- URL: http://arxiv.org/abs/2102.10616v1
- Date: Sun, 21 Feb 2021 14:46:50 GMT
- Title: Dealing with Non-Stationarity in Multi-Agent Reinforcement Learning via
Trust Region Decomposition
- Authors: Wenhao Li, Xiangfeng Wang, Bo Jin, Junjie Sheng, Hongyuan Zha
- Abstract summary: Non-stationarity is one thorny issue in multi-agent reinforcement learning.
We introduce a $delta$-stationarity measurement to explicitly model the stationarity of a policy sequence.
We propose a trust region decomposition network based on message passing to estimate the joint policy divergence.
- Score: 52.06086375833474
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Non-stationarity is one thorny issue in multi-agent reinforcement learning,
which is caused by the policy changes of agents during the learning procedure.
Current works to solve this problem have their own limitations in effectiveness
and scalability, such as centralized critic and decentralized actor (CCDA),
population-based self-play, modeling of others and etc. In this paper, we
novelly introduce a $\delta$-stationarity measurement to explicitly model the
stationarity of a policy sequence, which is theoretically proved to be
proportional to the joint policy divergence. However, simple policy
factorization like mean-field approximation will mislead to larger policy
divergence, which can be considered as trust region decomposition dilemma. We
model the joint policy as a general Markov random field and propose a trust
region decomposition network based on message passing to estimate the joint
policy divergence more accurately. The Multi-Agent Mirror descent policy
algorithm with Trust region decomposition, called MAMT, is established with the
purpose to satisfy $\delta$-stationarity. MAMT can adjust the trust region of
the local policies adaptively in an end-to-end manner, thereby approximately
constraining the divergence of joint policy to alleviate the non-stationary
problem. Our method can bring noticeable and stable performance improvement
compared with baselines in coordination tasks of different complexity.
Related papers
- Policy Bifurcation in Safe Reinforcement Learning [35.75059015441807]
In some scenarios, the feasible policy should be discontinuous or multi-valued, interpolating between discontinuous local optima can inevitably lead to constraint violations.
We are the first to identify the generating mechanism of such a phenomenon, and employ topological analysis to rigorously prove the existence of bifurcation in safe RL.
We propose a safe RL algorithm called multimodal policy optimization (MUPO), which utilizes a Gaussian mixture distribution as the policy output.
arXiv Detail & Related papers (2024-03-19T15:54:38Z) - AgentMixer: Multi-Agent Correlated Policy Factorization [39.041191852287525]
We introduce textitstrategy modification to provide a mechanism for agents to correlate their policies.
We present a novel framework, AgentMixer, which constructs the joint fully observable policy as a non-linear combination of individual partially observable policies.
We show that AgentMixer converges to an $epsilon$-approximate Correlated Equilibrium.
arXiv Detail & Related papers (2024-01-16T15:32:41Z) - Policy Dispersion in Non-Markovian Environment [53.05904889617441]
This paper tries to learn the diverse policies from the history of state-action pairs under a non-Markovian environment.
We first adopt a transformer-based method to learn policy embeddings.
Then, we stack the policy embeddings to construct a dispersion matrix to induce a set of diverse policies.
arXiv Detail & Related papers (2023-02-28T11:58:39Z) - Monotonic Improvement Guarantees under Non-stationarity for
Decentralized PPO [66.5384483339413]
We present a new monotonic improvement guarantee for optimizing decentralized policies in cooperative Multi-Agent Reinforcement Learning (MARL)
We show that a trust region constraint can be effectively enforced in a principled way by bounding independent ratios based on the number of agents in training.
arXiv Detail & Related papers (2022-01-31T20:39:48Z) - Trust Region Policy Optimisation in Multi-Agent Reinforcement Learning [25.027143431992755]
Trust region methods rigorously enabled reinforcement learning (RL) agents to learn monotonically improving policies, leading to superior performance on a variety of tasks.
Unfortunately, when it comes to multi-agent reinforcement learning (MARL), the property of monotonic improvement may not simply apply.
In this paper, we extend the theory of trust region learning to MARL. Central to our findings are the multi-agent advantage decomposition lemma and the sequential policy update scheme.
Based on these, we develop Heterogeneous-Agent Trust Region Policy optimisation (HATPRO) and Heterogeneous-Agent Proximal Policy optimisation (
arXiv Detail & Related papers (2021-09-23T09:44:35Z) - Implicit Distributional Reinforcement Learning [61.166030238490634]
implicit distributional actor-critic (IDAC) built on two deep generator networks (DGNs)
Semi-implicit actor (SIA) powered by a flexible policy distribution.
We observe IDAC outperforms state-of-the-art algorithms on representative OpenAI Gym environments.
arXiv Detail & Related papers (2020-07-13T02:52:18Z) - Variational Policy Propagation for Multi-agent Reinforcement Learning [68.26579560607597]
We propose a emphcollaborative multi-agent reinforcement learning algorithm named variational policy propagation (VPP) to learn a emphjoint policy through the interactions over agents.
We prove that the joint policy is a Markov Random Field under some mild conditions, which in turn reduces the policy space effectively.
We integrate the variational inference as special differentiable layers in policy such as the actions can be efficiently sampled from the Markov Random Field and the overall policy is differentiable.
arXiv Detail & Related papers (2020-04-19T15:42:55Z) - Stable Policy Optimization via Off-Policy Divergence Regularization [50.98542111236381]
Trust Region Policy Optimization (TRPO) and Proximal Policy Optimization (PPO) are among the most successful policy gradient approaches in deep reinforcement learning (RL)
We propose a new algorithm which stabilizes the policy improvement through a proximity term that constrains the discounted state-action visitation distribution induced by consecutive policies to be close to one another.
Our proposed method can have a beneficial effect on stability and improve final performance in benchmark high-dimensional control tasks.
arXiv Detail & Related papers (2020-03-09T13:05:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.