Model-Based Decentralized Policy Optimization
- URL: http://arxiv.org/abs/2302.08139v1
- Date: Thu, 16 Feb 2023 08:15:18 GMT
- Title: Model-Based Decentralized Policy Optimization
- Authors: Hao Luo, Jiechuan Jiang, and Zongqing Lu
- Abstract summary: Decentralized policy optimization has been commonly used in cooperative multi-agent tasks.
We propose model-based decentralized policy optimization (MDPO)
We theoretically analyze that the policy optimization of MDPO is more stable than model-free decentralized policy optimization.
- Score: 27.745312627153012
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Decentralized policy optimization has been commonly used in cooperative
multi-agent tasks. However, since all agents are updating their policies
simultaneously, from the perspective of individual agents, the environment is
non-stationary, resulting in it being hard to guarantee monotonic policy
improvement. To help the policy improvement be stable and monotonic, we propose
model-based decentralized policy optimization (MDPO), which incorporates a
latent variable function to help construct the transition and reward function
from an individual perspective. We theoretically analyze that the policy
optimization of MDPO is more stable than model-free decentralized policy
optimization. Moreover, due to non-stationarity, the latent variable function
is varying and hard to be modeled. We further propose a latent variable
prediction method to reduce the error of the latent variable function, which
theoretically contributes to the monotonic policy improvement. Empirically,
MDPO can indeed obtain superior performance than model-free decentralized
policy optimization in a variety of cooperative multi-agent tasks.
Related papers
- Reparameterized Policy Learning for Multimodal Trajectory Optimization [61.13228961771765]
We investigate the challenge of parametrizing policies for reinforcement learning in high-dimensional continuous action spaces.
We propose a principled framework that models the continuous RL policy as a generative model of optimal trajectories.
We present a practical model-based RL method, which leverages the multimodal policy parameterization and learned world model.
arXiv Detail & Related papers (2023-07-20T09:05:46Z) - Local Optimization Achieves Global Optimality in Multi-Agent
Reinforcement Learning [139.53668999720605]
We present a multi-agent PPO algorithm in which the local policy of each agent is updated similarly to vanilla PPO.
We prove that with standard regularity conditions on the Markov game and problem-dependent quantities, our algorithm converges to the globally optimal policy at a sublinear rate.
arXiv Detail & Related papers (2023-05-08T16:20:03Z) - Trust-Region-Free Policy Optimization for Stochastic Policies [60.52463923712565]
We show that the trust region constraint over policies can be safely substituted by a trust-region-free constraint without compromising the underlying monotonic improvement guarantee.
We call the resulting algorithm Trust-REgion-Free Policy Optimization (TREFree) explicit as it is free of any trust region constraints.
arXiv Detail & Related papers (2023-02-15T23:10:06Z) - Robust Policy Optimization in Deep Reinforcement Learning [16.999444076456268]
In continuous action domains, parameterized distribution of action distribution allows easy control of exploration.
In particular, we propose an algorithm called Robust Policy Optimization (RPO), which leverages a perturbed distribution.
We evaluated our methods on various continuous control tasks from DeepMind Control, OpenAI Gym, Pybullet, and IsaacGym.
arXiv Detail & Related papers (2022-12-14T22:43:56Z) - Decentralized Policy Optimization [21.59254848913971]
We propose textitdecentralized policy optimization (DPO), a decentralized actor-critic algorithm with monotonic improvement and convergence guarantee.
Empirically, we compare DPO with IPPO in a variety of cooperative multi-agent tasks, covering discrete and continuous action spaces, and fully and partially observable environments.
arXiv Detail & Related papers (2022-11-06T05:38:23Z) - Towards Global Optimality in Cooperative MARL with the Transformation
And Distillation Framework [26.612749327414335]
Decentralized execution is one core demand in cooperative multi-agent reinforcement learning (MARL)
In this paper, we theoretically analyze two common classes of algorithms with decentralized policies -- multi-agent policy gradient methods and value-decomposition methods.
We show that TAD-PPO can theoretically perform optimal policy learning in the finite multi-agent MDPs and shows significant outperformance on a large set of cooperative multi-agent tasks.
arXiv Detail & Related papers (2022-07-12T06:59:13Z) - Coordinated Proximal Policy Optimization [28.780862892562308]
Coordinated Proximal Policy Optimization (CoPPO) is an algorithm that extends the original Proximal Policy Optimization (PPO) to the multi-agent setting.
We prove the monotonicity of policy improvement when optimizing a theoretically-grounded joint objective.
We then interpret that such an objective in CoPPO can achieve dynamic credit assignment among agents, thereby alleviating the high variance issue during the concurrent update of agent policies.
arXiv Detail & Related papers (2021-11-07T11:14:19Z) - Permutation Invariant Policy Optimization for Mean-Field Multi-Agent
Reinforcement Learning: A Principled Approach [128.62787284435007]
We propose the mean-field proximal policy optimization (MF-PPO) algorithm, at the core of which is a permutation-invariant actor-critic neural architecture.
We prove that MF-PPO attains the globally optimal policy at a sublinear rate of convergence.
In particular, we show that the inductive bias introduced by the permutation-invariant neural architecture enables MF-PPO to outperform existing competitors.
arXiv Detail & Related papers (2021-05-18T04:35:41Z) - Dealing with Non-Stationarity in Multi-Agent Reinforcement Learning via
Trust Region Decomposition [52.06086375833474]
Non-stationarity is one thorny issue in multi-agent reinforcement learning.
We introduce a $delta$-stationarity measurement to explicitly model the stationarity of a policy sequence.
We propose a trust region decomposition network based on message passing to estimate the joint policy divergence.
arXiv Detail & Related papers (2021-02-21T14:46:50Z) - Optimistic Distributionally Robust Policy Optimization [2.345728642535161]
Trust Region Policy Optimization (TRPO) and Proximal Policy Optimization (PPO) are prone to converge to a sub-optimal solution as they limit policy representation to a particular parametric distribution class.
We develop an innovative Optimistic Distributionally Robust Policy Optimization (ODRO) algorithm to solve the trust region constrained optimization problem without parameterizing the policies.
Our algorithm improves TRPO and PPO with a higher sample efficiency and a better performance of the final policy while attaining the learning stability.
arXiv Detail & Related papers (2020-06-14T06:36:18Z) - Stable Policy Optimization via Off-Policy Divergence Regularization [50.98542111236381]
Trust Region Policy Optimization (TRPO) and Proximal Policy Optimization (PPO) are among the most successful policy gradient approaches in deep reinforcement learning (RL)
We propose a new algorithm which stabilizes the policy improvement through a proximity term that constrains the discounted state-action visitation distribution induced by consecutive policies to be close to one another.
Our proposed method can have a beneficial effect on stability and improve final performance in benchmark high-dimensional control tasks.
arXiv Detail & Related papers (2020-03-09T13:05:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.