Faster Last-iterate Convergence of Policy Optimization in Zero-Sum
Markov Games
- URL: http://arxiv.org/abs/2210.01050v2
- Date: Tue, 4 Oct 2022 01:07:33 GMT
- Title: Faster Last-iterate Convergence of Policy Optimization in Zero-Sum
Markov Games
- Authors: Shicong Cen, Yuejie Chi, Simon S. Du, Lin Xiao
- Abstract summary: This paper focuses on the most basic setting of competitive multi-agent RL, namely two-player zero-sum Markov games.
We propose a single-loop policy optimization method with symmetric updates from both agents, where the policy is updated via the entropy-regularized optimistic multiplicative weights update (OMWU) method.
Our convergence results improve upon the best known complexities, and lead to a better understanding of policy optimization in competitive Markov games.
- Score: 63.60117916422867
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multi-Agent Reinforcement Learning (MARL) -- where multiple agents learn to
interact in a shared dynamic environment -- permeates across a wide range of
critical applications. While there has been substantial progress on
understanding the global convergence of policy optimization methods in
single-agent RL, designing and analysis of efficient policy optimization
algorithms in the MARL setting present significant challenges, which
unfortunately, remain highly inadequately addressed by existing theory. In this
paper, we focus on the most basic setting of competitive multi-agent RL, namely
two-player zero-sum Markov games, and study equilibrium finding algorithms in
both the infinite-horizon discounted setting and the finite-horizon episodic
setting. We propose a single-loop policy optimization method with symmetric
updates from both agents, where the policy is updated via the
entropy-regularized optimistic multiplicative weights update (OMWU) method and
the value is updated on a slower timescale. We show that, in the
full-information tabular setting, the proposed method achieves a finite-time
last-iterate linear convergence to the quantal response equilibrium of the
regularized problem, which translates to a sublinear last-iterate convergence
to the Nash equilibrium by controlling the amount of regularization. Our
convergence results improve upon the best known iteration complexities, and
lead to a better understanding of policy optimization in competitive Markov
games.
Related papers
- Local Optimization Achieves Global Optimality in Multi-Agent
Reinforcement Learning [139.53668999720605]
We present a multi-agent PPO algorithm in which the local policy of each agent is updated similarly to vanilla PPO.
We prove that with standard regularity conditions on the Markov game and problem-dependent quantities, our algorithm converges to the globally optimal policy at a sublinear rate.
arXiv Detail & Related papers (2023-05-08T16:20:03Z) - Efficient Model-based Multi-agent Reinforcement Learning via Optimistic
Equilibrium Computation [93.52573037053449]
H-MARL (Hallucinated Multi-Agent Reinforcement Learning) learns successful equilibrium policies after a few interactions with the environment.
We demonstrate our approach experimentally on an autonomous driving simulation benchmark.
arXiv Detail & Related papers (2022-03-14T17:24:03Z) - Coordinated Proximal Policy Optimization [28.780862892562308]
Coordinated Proximal Policy Optimization (CoPPO) is an algorithm that extends the original Proximal Policy Optimization (PPO) to the multi-agent setting.
We prove the monotonicity of policy improvement when optimizing a theoretically-grounded joint objective.
We then interpret that such an objective in CoPPO can achieve dynamic credit assignment among agents, thereby alleviating the high variance issue during the concurrent update of agent policies.
arXiv Detail & Related papers (2021-11-07T11:14:19Z) - Momentum Accelerates the Convergence of Stochastic AUPRC Maximization [80.8226518642952]
We study optimization of areas under precision-recall curves (AUPRC), which is widely used for imbalanced tasks.
We develop novel momentum methods with a better iteration of $O (1/epsilon4)$ for finding an $epsilon$stationary solution.
We also design a novel family of adaptive methods with the same complexity of $O (1/epsilon4)$, which enjoy faster convergence in practice.
arXiv Detail & Related papers (2021-07-02T16:21:52Z) - Fast Policy Extragradient Methods for Competitive Games with Entropy
Regularization [40.21627891283402]
This paper investigates the problem of computing the equilibrium of competitive games.
Motivated by the algorithmic role of entropy regularization, we develop provably efficient extragradient methods.
arXiv Detail & Related papers (2021-05-31T17:51:15Z) - Policy Mirror Descent for Regularized Reinforcement Learning: A
Generalized Framework with Linear Convergence [60.20076757208645]
This paper proposes a general policy mirror descent (GPMD) algorithm for solving regularized RL.
We demonstrate that our algorithm converges linearly over an entire range learning rates, in a dimension-free fashion, to the global solution.
arXiv Detail & Related papers (2021-05-24T02:21:34Z) - CRPO: A New Approach for Safe Reinforcement Learning with Convergence
Guarantee [61.176159046544946]
In safe reinforcement learning (SRL) problems, an agent explores the environment to maximize an expected total reward and avoids violation of certain constraints.
This is the first-time analysis of SRL algorithms with global optimal policies.
arXiv Detail & Related papers (2020-11-11T16:05:14Z) - Risk-Sensitive Markov Decision Processes with Combined Metrics of Mean
and Variance [3.062772835338966]
This paper investigates the optimization problem of an infinite stage discrete time Markov decision process (MDP) with a long-run average metric.
A performance difference formula is derived and it can quantify the difference of the mean-variance combined metrics of MDPs under any two different policies.
A necessary condition of the optimal policy and the optimality of deterministic policies are derived.
arXiv Detail & Related papers (2020-08-09T10:35:35Z) - Convergence Guarantees of Policy Optimization Methods for Markovian Jump
Linear Systems [3.3343656101775365]
We show that the Gauss-Newton method converges to the optimal state feedback controller for MJLS at a linear rate if at a controller which stabilizes the closed-loop dynamics in the mean sense.
We present an example to support our theory.
arXiv Detail & Related papers (2020-02-10T21:13:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.