Related papers: Faster Last-iterate Convergence of Policy Optimization in Zero-Sum Markov Games

Faster Last-iterate Convergence of Policy Optimization in Zero-Sum Markov Games

URL: http://arxiv.org/abs/2210.01050v2
Date: Tue, 4 Oct 2022 01:07:33 GMT
Title: Faster Last-iterate Convergence of Policy Optimization in Zero-Sum Markov Games
Authors: Shicong Cen, Yuejie Chi, Simon S. Du, Lin Xiao
Abstract summary: This paper focuses on the most basic setting of competitive multi-agent RL, namely two-player zero-sum Markov games. We propose a single-loop policy optimization method with symmetric updates from both agents, where the policy is updated via the entropy-regularized optimistic multiplicative weights update (OMWU) method. Our convergence results improve upon the best known complexities, and lead to a better understanding of policy optimization in competitive Markov games.
Score: 63.60117916422867
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multi-Agent Reinforcement Learning (MARL) -- where multiple agents learn to interact in a shared dynamic environment -- permeates across a wide range of critical applications. While there has been substantial progress on understanding the global convergence of policy optimization methods in single-agent RL, designing and analysis of efficient policy optimization algorithms in the MARL setting present significant challenges, which unfortunately, remain highly inadequately addressed by existing theory. In this paper, we focus on the most basic setting of competitive multi-agent RL, namely two-player zero-sum Markov games, and study equilibrium finding algorithms in both the infinite-horizon discounted setting and the finite-horizon episodic setting. We propose a single-loop policy optimization method with symmetric updates from both agents, where the policy is updated via the entropy-regularized optimistic multiplicative weights update (OMWU) method and the value is updated on a slower timescale. We show that, in the full-information tabular setting, the proposed method achieves a finite-time last-iterate linear convergence to the quantal response equilibrium of the regularized problem, which translates to a sublinear last-iterate convergence to the Nash equilibrium by controlling the amount of regularization. Our convergence results improve upon the best known iteration complexities, and lead to a better understanding of policy optimization in competitive Markov games.

Related papers

Efficient and Scalable Deep Reinforcement Learning for Mean Field Control Games [16.62770187749295]
Mean Field Control Games (MFCGs) provide a powerful theoretical framework for analyzing systems of infinitely many interacting agents. This paper presents a scalable deep Reinforcement Learning (RL) approach to approximate equilibrium solutions of MFCGs.
arXiv Detail & Related papers (2024-12-28T02:04:53Z)
Offline Multi-Agent Reinforcement Learning via In-Sample Sequential Policy Optimization [8.877649895977479]
offline Multi-Agent Reinforcement Learning (MARL) is an emerging field that aims to learn optimal multi-agent policies from pre-collected datasets. In this work, we revisit the existing offline MARL methods and show that in certain scenarios they can be problematic. We propose a new offline MARL algorithm, named In-Sample Sequential Policy Optimization (InSPO)
arXiv Detail & Related papers (2024-12-10T16:19:08Z)
Local Optimization Achieves Global Optimality in Multi-Agent Reinforcement Learning [139.53668999720605]
We present a multi-agent PPO algorithm in which the local policy of each agent is updated similarly to vanilla PPO. We prove that with standard regularity conditions on the Markov game and problem-dependent quantities, our algorithm converges to the globally optimal policy at a sublinear rate.
arXiv Detail & Related papers (2023-05-08T16:20:03Z)
Efficient Model-based Multi-agent Reinforcement Learning via Optimistic Equilibrium Computation [93.52573037053449]
H-MARL (Hallucinated Multi-Agent Reinforcement Learning) learns successful equilibrium policies after a few interactions with the environment. We demonstrate our approach experimentally on an autonomous driving simulation benchmark.
arXiv Detail & Related papers (2022-03-14T17:24:03Z)
Coordinated Proximal Policy Optimization [28.780862892562308]
Coordinated Proximal Policy Optimization (CoPPO) is an algorithm that extends the original Proximal Policy Optimization (PPO) to the multi-agent setting. We prove the monotonicity of policy improvement when optimizing a theoretically-grounded joint objective. We then interpret that such an objective in CoPPO can achieve dynamic credit assignment among agents, thereby alleviating the high variance issue during the concurrent update of agent policies.
arXiv Detail & Related papers (2021-11-07T11:14:19Z)
Momentum Accelerates the Convergence of Stochastic AUPRC Maximization [80.8226518642952]
We study optimization of areas under precision-recall curves (AUPRC), which is widely used for imbalanced tasks. We develop novel momentum methods with a better iteration of $O (1/epsilon4)$ for finding an $epsilon$stationary solution. We also design a novel family of adaptive methods with the same complexity of $O (1/epsilon4)$, which enjoy faster convergence in practice.
arXiv Detail & Related papers (2021-07-02T16:21:52Z)
Fast Policy Extragradient Methods for Competitive Games with Entropy Regularization [40.21627891283402]
This paper investigates the problem of computing the equilibrium of competitive games. Motivated by the algorithmic role of entropy regularization, we develop provably efficient extragradient methods.
arXiv Detail & Related papers (2021-05-31T17:51:15Z)
Policy Mirror Descent for Regularized Reinforcement Learning: A Generalized Framework with Linear Convergence [60.20076757208645]
This paper proposes a general policy mirror descent (GPMD) algorithm for solving regularized RL. We demonstrate that our algorithm converges linearly over an entire range learning rates, in a dimension-free fashion, to the global solution.
arXiv Detail & Related papers (2021-05-24T02:21:34Z)
CRPO: A New Approach for Safe Reinforcement Learning with Convergence Guarantee [61.176159046544946]
In safe reinforcement learning (SRL) problems, an agent explores the environment to maximize an expected total reward and avoids violation of certain constraints. This is the first-time analysis of SRL algorithms with global optimal policies.
arXiv Detail & Related papers (2020-11-11T16:05:14Z)
Risk-Sensitive Markov Decision Processes with Combined Metrics of Mean and Variance [3.062772835338966]
This paper investigates the optimization problem of an infinite stage discrete time Markov decision process (MDP) with a long-run average metric. A performance difference formula is derived and it can quantify the difference of the mean-variance combined metrics of MDPs under any two different policies. A necessary condition of the optimal policy and the optimality of deterministic policies are derived.
arXiv Detail & Related papers (2020-08-09T10:35:35Z)
Convergence Guarantees of Policy Optimization Methods for Markovian Jump Linear Systems [3.3343656101775365]
We show that the Gauss-Newton method converges to the optimal state feedback controller for MJLS at a linear rate if at a controller which stabilizes the closed-loop dynamics in the mean sense. We present an example to support our theory.
arXiv Detail & Related papers (2020-02-10T21:13:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.