Policy Mirror Ascent for Efficient and Independent Learning in Mean
Field Games
- URL: http://arxiv.org/abs/2212.14449v2
- Date: Fri, 9 Jun 2023 12:06:32 GMT
- Title: Policy Mirror Ascent for Efficient and Independent Learning in Mean
Field Games
- Authors: Batuhan Yardim, Semih Cayci, Matthieu Geist, Niao He
- Abstract summary: Mean-field games have been used as a theoretical tool to obtain an approximate Nash equilibrium for symmetric and anonymous $N$-player games.
We show that $N$ agents running policy mirror ascent converge to the Nash equilibrium of the regularized game within $widetildemathcalO(varepsilon-2)$ samples.
- Score: 35.86199604587823
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Mean-field games have been used as a theoretical tool to obtain an
approximate Nash equilibrium for symmetric and anonymous $N$-player games.
However, limiting applicability, existing theoretical results assume variations
of a "population generative model", which allows arbitrary modifications of the
population distribution by the learning algorithm. Moreover, learning
algorithms typically work on abstract simulators with population instead of the
$N$-player game. Instead, we show that $N$ agents running policy mirror ascent
converge to the Nash equilibrium of the regularized game within
$\widetilde{\mathcal{O}}(\varepsilon^{-2})$ samples from a single sample
trajectory without a population generative model, up to a standard
$\mathcal{O}(\frac{1}{\sqrt{N}})$ error due to the mean field. Taking a
divergent approach from the literature, instead of working with the
best-response map we first show that a policy mirror ascent map can be used to
construct a contractive operator having the Nash equilibrium as its fixed
point. We analyze single-path TD learning for $N$-agent games, proving sample
complexity guarantees by only using a sample path from the $N$-agent simulator
without a population generative model. Furthermore, we demonstrate that our
methodology allows for independent learning by $N$ agents with finite sample
guarantees.
Related papers
- Last-Iterate Convergence of Payoff-Based Independent Learning in Zero-Sum Stochastic Games [31.554420227087043]
We develop learning dynamics that are payoff-based, convergent, rational, and symmetric between the two players.
In the matrix game setting, the results imply a complexity of $O(epsilon-1)$ to find the Nash distribution.
In the game setting, the results also imply a complexity of $O(epsilon-8)$ to find a Nash equilibrium.
arXiv Detail & Related papers (2024-09-02T20:07:25Z) - Exploiting Approximate Symmetry for Efficient Multi-Agent Reinforcement Learning [19.543995541149897]
We provide a methodology to extend any finite-player, possibly asymmetric, game to an "induced MFG"
First, we prove that $N$-player dynamic games can be symmetrized and smoothly extended to the infinite-player continuum via explicit Kirszbraun extensions.
For certain games satisfying monotonicity, we prove a sample complexity of $widetildemathcalO(varepsilon-6)$ for the $N$-agent game to learn an $varepsilon$-Nash up to symmetrization bias.
arXiv Detail & Related papers (2024-08-27T16:11:20Z) - MF-OML: Online Mean-Field Reinforcement Learning with Occupation Measures for Large Population Games [5.778024594615575]
This paper proposes an online mean-field reinforcement learning algorithm for computing Nash equilibria of sequential games.
MFOML is the first fully approximate multi-agent reinforcement learning algorithm for provably solving Nash equilibria.
As a byproduct, we also obtain the first tractable globally convergent computational for approximate computing of monotone mean-field games.
arXiv Detail & Related papers (2024-05-01T02:19:31Z) - Scalable and Independent Learning of Nash Equilibrium Policies in
$n$-Player Stochastic Games with Unknown Independent Chains [1.0878040851638]
We study games with independent chains and unknown transition matrices.
In this class of games, players control their own internal Markov chains whose transitions do not depend on the states/actions of other players.
We propose a fully decentralized mirror descent algorithm to learn an $epsilon$-NE policy.
arXiv Detail & Related papers (2023-12-04T03:04:09Z) - Representation Learning for General-sum Low-rank Markov Games [63.119870889883224]
We study multi-agent general-sum Markov games with nonlinear function approximation.
We focus on low-rank Markov games whose transition matrix admits a hidden low-rank structure on top of an unknown non-linear representation.
arXiv Detail & Related papers (2022-10-30T22:58:22Z) - Minimax-Optimal Multi-Agent RL in Zero-Sum Markov Games With a
Generative Model [50.38446482252857]
Two-player zero-sum Markov games are arguably the most basic setting in multi-agent reinforcement learning.
We develop a learning algorithm that learns an $varepsilon$-approximate Markov NE policy using $$ widetildeObigg.
We derive a refined regret bound for FTRL that makes explicit the role of variance-type quantities.
arXiv Detail & Related papers (2022-08-22T17:24:55Z) - Learning Correlated Equilibria in Mean-Field Games [62.14589406821103]
We develop the concepts of Mean-Field correlated and coarse-correlated equilibria.
We show that they can be efficiently learnt in emphall games, without requiring any additional assumption on the structure of the game.
arXiv Detail & Related papers (2022-08-22T08:31:46Z) - A Sharp Analysis of Model-based Reinforcement Learning with Self-Play [49.88233710867315]
We present a sharp analysis of model-based self-play algorithms for multi-agent Markov games.
We design an algorithm -- Optimistic Nash Value Iteration (Nash-VI) for two-player zero-sum Markov games.
We further adapt our analysis to designing a provably efficient task-agnostic algorithm for zero-sum Markov games.
arXiv Detail & Related papers (2020-10-04T15:27:39Z) - Model-Based Multi-Agent RL in Zero-Sum Markov Games with Near-Optimal
Sample Complexity [67.02490430380415]
We show that model-based MARL achieves a sample complexity of $tilde O(|S||B|(gamma)-3epsilon-2)$ for finding the Nash equilibrium (NE) value up to some $epsilon$ error.
We also show that such a sample bound is minimax-optimal (up to logarithmic factors) if the algorithm is reward-agnostic, where the algorithm queries state transition samples without reward knowledge.
arXiv Detail & Related papers (2020-07-15T03:25:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.