When Can We Learn General-Sum Markov Games with a Large Number of
Players Sample-Efficiently?
- URL: http://arxiv.org/abs/2110.04184v1
- Date: Fri, 8 Oct 2021 15:06:22 GMT
- Title: When Can We Learn General-Sum Markov Games with a Large Number of
Players Sample-Efficiently?
- Authors: Ziang Song, Song Mei, Yu Bai
- Abstract summary: This paper investigates what learning goals admit better sample complexities in the setting of $m$-player general-sum Markov games.
First, we design algorithms for learning an $epsilon$-Coarse Correlated Equilibrium (CCE) in $widetildemathcalO(H5Smax_ile m A_i / epsilon2)$ episodes.
Second, we consider the important special case of Markov Potential Games, and design an algorithm that learns an $epsilon$-approximate Nash equilibrium within $widet
- Score: 10.397170312149067
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multi-agent reinforcement learning has made substantial empirical progresses
in solving games with a large number of players. However, theoretically, the
best known sample complexity for finding a Nash equilibrium in general-sum
games scales exponentially in the number of players due to the size of the
joint action space, and there is a matching exponential lower bound. This paper
investigates what learning goals admit better sample complexities in the
setting of $m$-player general-sum Markov games with $H$ steps, $S$ states, and
$A_i$ actions per player. First, we design algorithms for learning an
$\epsilon$-Coarse Correlated Equilibrium (CCE) in
$\widetilde{\mathcal{O}}(H^5S\max_{i\le m} A_i / \epsilon^2)$ episodes, and an
$\epsilon$-Correlated Equilibrium (CE) in
$\widetilde{\mathcal{O}}(H^6S\max_{i\le m} A_i^2 / \epsilon^2)$ episodes. This
is the first line of results for learning CCE and CE with sample complexities
polynomial in $\max_{i\le m} A_i$. Our algorithm for learning CE integrates an
adversarial bandit subroutine which minimizes a weighted swap regret, along
with several novel designs in the outer loop. Second, we consider the important
special case of Markov Potential Games, and design an algorithm that learns an
$\epsilon$-approximate Nash equilibrium within
$\widetilde{\mathcal{O}}(S\sum_{i\le m} A_i / \epsilon^3)$ episodes (when only
highlighting the dependence on $S$, $A_i$, and $\epsilon$), which only depends
linearly in $\sum_{i\le m} A_i$ and significantly improves over the best known
algorithm in the $\epsilon$ dependence. Overall, our results shed light on what
equilibria or structural assumptions on the game may enable sample-efficient
learning with many players.
Related papers
- $\widetilde{O}(T^{-1})$ Convergence to (Coarse) Correlated Equilibria in Full-Information General-Sum Markov Games [8.215874655947335]
We show that an optimistic-follow-the-regularized-leader algorithm can find $widetildeO(T-1)$-approximate iterations in full-information general-sum Markov games within $T$.
arXiv Detail & Related papers (2024-02-02T20:40:27Z) - Representation Learning for General-sum Low-rank Markov Games [63.119870889883224]
We study multi-agent general-sum Markov games with nonlinear function approximation.
We focus on low-rank Markov games whose transition matrix admits a hidden low-rank structure on top of an unknown non-linear representation.
arXiv Detail & Related papers (2022-10-30T22:58:22Z) - Minimax-Optimal Multi-Agent RL in Zero-Sum Markov Games With a
Generative Model [50.38446482252857]
Two-player zero-sum Markov games are arguably the most basic setting in multi-agent reinforcement learning.
We develop a learning algorithm that learns an $varepsilon$-approximate Markov NE policy using $$ widetildeObigg.
We derive a refined regret bound for FTRL that makes explicit the role of variance-type quantities.
arXiv Detail & Related papers (2022-08-22T17:24:55Z) - Model-Based Reinforcement Learning for Offline Zero-Sum Markov Games [18.832436856339587]
This paper makes progress towards learning Nash equilibria in two-player zero-sum Markov games from offline data.
We propose a pessimistic model-based algorithm with Bernstein-style lower confidence bounds -- called VI-LCB-Game.
arXiv Detail & Related papers (2022-06-08T17:58:06Z) - Sample-Efficient Learning of Correlated Equilibria in Extensive-Form
Games [22.94768429216664]
Imperfect-Information Extensive-Form Games (IIEFGs) is a prevalent model for real-world games involving imperfect information and sequential plays.
The Extensive-Form Correlated Equilibrium (EFCE) has been proposed as a natural solution concept for multi-player general-sum IIEFGs.
This paper presents the first sample-efficient algorithm for learning the EFCE from bandit feedback.
arXiv Detail & Related papers (2022-05-15T08:55:28Z) - Near-Optimal Learning of Extensive-Form Games with Imperfect Information [54.55092907312749]
We present the first line of algorithms that require only $widetildemathcalO((XA+YB)/varepsilon2)$ episodes of play to find an $varepsilon$-approximate Nash equilibrium in two-player zero-sum games.
This improves upon the best known sample complexity of $widetildemathcalO((X2A+Y2B)/varepsilon2)$ by a factor of $widetildemathcalO(maxX,
arXiv Detail & Related papers (2022-02-03T18:18:28Z) - Almost Optimal Algorithms for Two-player Markov Games with Linear
Function Approximation [92.99933928528797]
We study reinforcement learning for two-player zero-sum Markov games with simultaneous moves.
We propose an algorithm Nash-UCRL-VTR based on the principle "Optimism-in-Face-of-Uncertainty"
We show that Nash-UCRL-VTR can provably achieve an $tildeO(dHsqrtT)$ regret, where $d$ is the linear function dimension.
arXiv Detail & Related papers (2021-02-15T09:09:16Z) - Near-Optimal Reinforcement Learning with Self-Play [50.29853537456737]
We focus on self-play algorithms which learn the optimal policy by playing against itself without any direct supervision.
We propose an optimistic variant of the emphNash Q-learning algorithm with sample complexity $tildemathcalO(SAB)$, and a new emphNash V-learning algorithm with sample complexity $tildemathcalO(S(A+B))$.
arXiv Detail & Related papers (2020-06-22T05:00:13Z) - Model-Free Reinforcement Learning: from Clipped Pseudo-Regret to Sample
Complexity [59.34067736545355]
Given an MDP with $S$ states, $A$ actions, the discount factor $gamma in (0,1)$, and an approximation threshold $epsilon > 0$, we provide a model-free algorithm to learn an $epsilon$-optimal policy.
For small enough $epsilon$, we show an improved algorithm with sample complexity.
arXiv Detail & Related papers (2020-06-06T13:34:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.