Monotonic Improvement Guarantees under Non-stationarity for
Decentralized PPO
- URL: http://arxiv.org/abs/2202.00082v1
- Date: Mon, 31 Jan 2022 20:39:48 GMT
- Title: Monotonic Improvement Guarantees under Non-stationarity for
Decentralized PPO
- Authors: Mingfei Sun, Sam Devlin, Katja Hofmann, Shimon Whiteson
- Abstract summary: We present a new monotonic improvement guarantee for optimizing decentralized policies in cooperative Multi-Agent Reinforcement Learning (MARL)
We show that a trust region constraint can be effectively enforced in a principled way by bounding independent ratios based on the number of agents in training.
- Score: 66.5384483339413
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present a new monotonic improvement guarantee for optimizing decentralized
policies in cooperative Multi-Agent Reinforcement Learning (MARL), which holds
even when the transition dynamics are non-stationary. This new analysis
provides a theoretical understanding of the strong performance of two recent
actor-critic methods for MARL, i.e., Independent Proximal Policy Optimization
(IPPO) and Multi-Agent PPO (MAPPO), which both rely on independent ratios,
i.e., computing probability ratios separately for each agent's policy. We show
that, despite the non-stationarity that independent ratios cause, a monotonic
improvement guarantee still arises as a result of enforcing the trust region
constraint over all decentralized policies. We also show this trust region
constraint can be effectively enforced in a principled way by bounding
independent ratios based on the number of agents in training, providing a
theoretical foundation for proximal ratio clipping. Moreover, we show that the
surrogate objectives optimized in IPPO and MAPPO are essentially equivalent
when their critics converge to a fixed point. Finally, our empirical results
support the hypothesis that the strong performance of IPPO and MAPPO is a
direct result of enforcing such a trust region constraint via clipping in
centralized training, and the good values of the hyperparameters for this
enforcement are highly sensitive to the number of agents, as predicted by our
theoretical analysis.
Related papers
- Off-Policy Evaluation in Markov Decision Processes under Weak
Distributional Overlap [5.0401589279256065]
We re-visit the task of off-policy evaluation in Markov decision processes (MDPs) under a weaker notion of distributional overlap.
We introduce a class of truncated doubly robust (TDR) estimators which we find to perform well in this setting.
arXiv Detail & Related papers (2024-02-13T03:55:56Z) - Mimicking Better by Matching the Approximate Action Distribution [48.95048003354255]
We introduce MAAD, a novel, sample-efficient on-policy algorithm for Imitation Learning from Observations.
We show that it requires considerable fewer interactions to achieve expert performance, outperforming current state-of-the-art on-policy methods.
arXiv Detail & Related papers (2023-06-16T12:43:47Z) - Trust-Region-Free Policy Optimization for Stochastic Policies [60.52463923712565]
We show that the trust region constraint over policies can be safely substituted by a trust-region-free constraint without compromising the underlying monotonic improvement guarantee.
We call the resulting algorithm Trust-REgion-Free Policy Optimization (TREFree) explicit as it is free of any trust region constraints.
arXiv Detail & Related papers (2023-02-15T23:10:06Z) - Decentralized Policy Optimization [21.59254848913971]
We propose textitdecentralized policy optimization (DPO), a decentralized actor-critic algorithm with monotonic improvement and convergence guarantee.
Empirically, we compare DPO with IPPO in a variety of cooperative multi-agent tasks, covering discrete and continuous action spaces, and fully and partially observable environments.
arXiv Detail & Related papers (2022-11-06T05:38:23Z) - Trust Region Policy Optimisation in Multi-Agent Reinforcement Learning [25.027143431992755]
Trust region methods rigorously enabled reinforcement learning (RL) agents to learn monotonically improving policies, leading to superior performance on a variety of tasks.
Unfortunately, when it comes to multi-agent reinforcement learning (MARL), the property of monotonic improvement may not simply apply.
In this paper, we extend the theory of trust region learning to MARL. Central to our findings are the multi-agent advantage decomposition lemma and the sequential policy update scheme.
Based on these, we develop Heterogeneous-Agent Trust Region Policy optimisation (HATPRO) and Heterogeneous-Agent Proximal Policy optimisation (
arXiv Detail & Related papers (2021-09-23T09:44:35Z) - Dealing with Non-Stationarity in Multi-Agent Reinforcement Learning via
Trust Region Decomposition [52.06086375833474]
Non-stationarity is one thorny issue in multi-agent reinforcement learning.
We introduce a $delta$-stationarity measurement to explicitly model the stationarity of a policy sequence.
We propose a trust region decomposition network based on message passing to estimate the joint policy divergence.
arXiv Detail & Related papers (2021-02-21T14:46:50Z) - Implicit Distributional Reinforcement Learning [61.166030238490634]
implicit distributional actor-critic (IDAC) built on two deep generator networks (DGNs)
Semi-implicit actor (SIA) powered by a flexible policy distribution.
We observe IDAC outperforms state-of-the-art algorithms on representative OpenAI Gym environments.
arXiv Detail & Related papers (2020-07-13T02:52:18Z) - Stable Policy Optimization via Off-Policy Divergence Regularization [50.98542111236381]
Trust Region Policy Optimization (TRPO) and Proximal Policy Optimization (PPO) are among the most successful policy gradient approaches in deep reinforcement learning (RL)
We propose a new algorithm which stabilizes the policy improvement through a proximity term that constrains the discounted state-action visitation distribution induced by consecutive policies to be close to one another.
Our proposed method can have a beneficial effect on stability and improve final performance in benchmark high-dimensional control tasks.
arXiv Detail & Related papers (2020-03-09T13:05:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.