Variational Policy Propagation for Multi-agent Reinforcement Learning
- URL: http://arxiv.org/abs/2004.08883v4
- Date: Sat, 29 Jan 2022 11:08:12 GMT
- Title: Variational Policy Propagation for Multi-agent Reinforcement Learning
- Authors: Chao Qu, Hui Li, Chang Liu, Junwu Xiong, James Zhang, Wei Chu,
Weiqiang Wang, Yuan Qi, Le Song
- Abstract summary: We propose a emphcollaborative multi-agent reinforcement learning algorithm named variational policy propagation (VPP) to learn a emphjoint policy through the interactions over agents.
We prove that the joint policy is a Markov Random Field under some mild conditions, which in turn reduces the policy space effectively.
We integrate the variational inference as special differentiable layers in policy such as the actions can be efficiently sampled from the Markov Random Field and the overall policy is differentiable.
- Score: 68.26579560607597
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose a \emph{collaborative} multi-agent reinforcement learning
algorithm named variational policy propagation (VPP) to learn a \emph{joint}
policy through the interactions over agents. We prove that the joint policy is
a Markov Random Field under some mild conditions, which in turn reduces the
policy space effectively. We integrate the variational inference as special
differentiable layers in policy such that the actions can be efficiently
sampled from the Markov Random Field and the overall policy is differentiable.
We evaluate our algorithm on several large scale challenging tasks and
demonstrate that it outperforms previous state-of-the-arts.
Related papers
- OMPO: A Unified Framework for RL under Policy and Dynamics Shifts [42.57662196581823]
Training reinforcement learning policies using environment interaction data collected from varying policies or dynamics presents a fundamental challenge.
Existing works often overlook the distribution discrepancies induced by policy or dynamics shifts, or rely on specialized algorithms with task priors.
In this paper, we identify a unified strategy for online RL policy learning under diverse settings of policy and dynamics shifts: transition occupancy matching.
arXiv Detail & Related papers (2024-05-29T13:36:36Z) - Local Optimization Achieves Global Optimality in Multi-Agent
Reinforcement Learning [139.53668999720605]
We present a multi-agent PPO algorithm in which the local policy of each agent is updated similarly to vanilla PPO.
We prove that with standard regularity conditions on the Markov game and problem-dependent quantities, our algorithm converges to the globally optimal policy at a sublinear rate.
arXiv Detail & Related papers (2023-05-08T16:20:03Z) - Policy Dispersion in Non-Markovian Environment [53.05904889617441]
This paper tries to learn the diverse policies from the history of state-action pairs under a non-Markovian environment.
We first adopt a transformer-based method to learn policy embeddings.
Then, we stack the policy embeddings to construct a dispersion matrix to induce a set of diverse policies.
arXiv Detail & Related papers (2023-02-28T11:58:39Z) - Semi-On-Policy Training for Sample Efficient Multi-Agent Policy
Gradients [51.749831824106046]
We introduce semi-on-policy (SOP) training as an effective and computationally efficient way to address the sample inefficiency of on-policy policy gradient methods.
We show that our methods perform as well or better than state-of-the-art value-based methods on a variety of SMAC tasks.
arXiv Detail & Related papers (2021-04-27T19:37:01Z) - Dealing with Non-Stationarity in Multi-Agent Reinforcement Learning via
Trust Region Decomposition [52.06086375833474]
Non-stationarity is one thorny issue in multi-agent reinforcement learning.
We introduce a $delta$-stationarity measurement to explicitly model the stationarity of a policy sequence.
We propose a trust region decomposition network based on message passing to estimate the joint policy divergence.
arXiv Detail & Related papers (2021-02-21T14:46:50Z) - Risk-Sensitive Deep RL: Variance-Constrained Actor-Critic Provably Finds
Globally Optimal Policy [95.98698822755227]
We make the first attempt to study risk-sensitive deep reinforcement learning under the average reward setting with the variance risk criteria.
We propose an actor-critic algorithm that iteratively and efficiently updates the policy, the Lagrange multiplier, and the Fenchel dual variable.
arXiv Detail & Related papers (2020-12-28T05:02:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.