Sample-Efficient Learning of Correlated Equilibria in Extensive-Form
Games
- URL: http://arxiv.org/abs/2205.07223v1
- Date: Sun, 15 May 2022 08:55:28 GMT
- Title: Sample-Efficient Learning of Correlated Equilibria in Extensive-Form
Games
- Authors: Ziang Song, Song Mei, Yu Bai
- Abstract summary: Imperfect-Information Extensive-Form Games (IIEFGs) is a prevalent model for real-world games involving imperfect information and sequential plays.
The Extensive-Form Correlated Equilibrium (EFCE) has been proposed as a natural solution concept for multi-player general-sum IIEFGs.
This paper presents the first sample-efficient algorithm for learning the EFCE from bandit feedback.
- Score: 22.94768429216664
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Imperfect-Information Extensive-Form Games (IIEFGs) is a prevalent model for
real-world games involving imperfect information and sequential plays. The
Extensive-Form Correlated Equilibrium (EFCE) has been proposed as a natural
solution concept for multi-player general-sum IIEFGs. However, existing
algorithms for finding an EFCE require full feedback from the game, and it
remains open how to efficiently learn the EFCE in the more challenging bandit
feedback setting where the game can only be learned by observations from
repeated playing.
This paper presents the first sample-efficient algorithm for learning the
EFCE from bandit feedback. We begin by proposing $K$-EFCE -- a more generalized
definition that allows players to observe and deviate from the recommended
actions for $K$ times. The $K$-EFCE includes the EFCE as a special case at
$K=1$, and is an increasingly stricter notion of equilibrium as $K$ increases.
We then design an uncoupled no-regret algorithm that finds an
$\varepsilon$-approximate $K$-EFCE within
$\widetilde{\mathcal{O}}(\max_{i}X_iA_i^{K}/\varepsilon^2)$ iterations in the
full feedback setting, where $X_i$ and $A_i$ are the number of information sets
and actions for the $i$-th player. Our algorithm works by minimizing a
wide-range regret at each information set that takes into account all possible
recommendation histories. Finally, we design a sample-based variant of our
algorithm that learns an $\varepsilon$-approximate $K$-EFCE within
$\widetilde{\mathcal{O}}(\max_{i}X_iA_i^{K+1}/\varepsilon^2)$ episodes of play
in the bandit feedback setting. When specialized to $K=1$, this gives the first
sample-efficient algorithm for learning EFCE from bandit feedback.
Related papers
- Fast Rates for Bandit PAC Multiclass Classification [73.17969992976501]
We study multiclass PAC learning with bandit feedback, where inputs are classified into one of $K$ possible labels and feedback is limited to whether or not the predicted labels are correct.
Our main contribution is in designing a novel learning algorithm for the agnostic $(varepsilon,delta)$PAC version of the problem.
arXiv Detail & Related papers (2024-06-18T08:54:04Z) - Efficient $\Phi$-Regret Minimization in Extensive-Form Games via Online
Mirror Descent [49.93548413166884]
$Phi$-Hedge is a generic algorithm capable of learning a large class of equilibria for Normal-Form Games (NFGs)
We show that $Phi$-Hedge can be directly used to learn Nash Equilibria (zero-sum settings), Normal-Form Coarse Correlated Equilibria (NFCCE), and Extensive-Form Correlated Equilibria (EFCE) in EFGs.
We prove that, in those settings, the emph$Phi$-Hedge algorithms are equivalent to standard Mirror Descent (OMD) algorithms for
arXiv Detail & Related papers (2022-05-30T17:58:06Z) - Near-Optimal Learning of Extensive-Form Games with Imperfect Information [54.55092907312749]
We present the first line of algorithms that require only $widetildemathcalO((XA+YB)/varepsilon2)$ episodes of play to find an $varepsilon$-approximate Nash equilibrium in two-player zero-sum games.
This improves upon the best known sample complexity of $widetildemathcalO((X2A+Y2B)/varepsilon2)$ by a factor of $widetildemathcalO(maxX,
arXiv Detail & Related papers (2022-02-03T18:18:28Z) - Doubly Optimal No-Regret Online Learning in Strongly Monotone Games with Bandit Feedback [29.553652241608997]
We study the class of textitsmooth and strongly monotone games and study optimal no-regret learning therein.
We first construct a new bandit learning algorithm and show that it achieves the single-agent optimal regret of $tildeTheta(nsqrtT)$.
Our results thus settle this open problem and contribute to the broad landscape of bandit game-theoretical learning.
arXiv Detail & Related papers (2021-12-06T08:27:54Z) - When Can We Learn General-Sum Markov Games with a Large Number of
Players Sample-Efficiently? [10.397170312149067]
This paper investigates what learning goals admit better sample complexities in the setting of $m$-player general-sum Markov games.
First, we design algorithms for learning an $epsilon$-Coarse Correlated Equilibrium (CCE) in $widetildemathcalO(H5Smax_ile m A_i / epsilon2)$ episodes.
Second, we consider the important special case of Markov Potential Games, and design an algorithm that learns an $epsilon$-approximate Nash equilibrium within $widet
arXiv Detail & Related papers (2021-10-08T15:06:22Z) - Almost Optimal Algorithms for Two-player Markov Games with Linear
Function Approximation [92.99933928528797]
We study reinforcement learning for two-player zero-sum Markov games with simultaneous moves.
We propose an algorithm Nash-UCRL-VTR based on the principle "Optimism-in-Face-of-Uncertainty"
We show that Nash-UCRL-VTR can provably achieve an $tildeO(dHsqrtT)$ regret, where $d$ is the linear function dimension.
arXiv Detail & Related papers (2021-02-15T09:09:16Z) - Impact of Representation Learning in Linear Bandits [83.17684841392754]
We study how representation learning can improve the efficiency of bandit problems.
We present a new algorithm which achieves $widetildeO(TsqrtkN + sqrtdkNT)$ regret, where $N$ is the number of rounds we play for each bandit.
arXiv Detail & Related papers (2020-10-13T16:35:30Z) - Near-Optimal Reinforcement Learning with Self-Play [50.29853537456737]
We focus on self-play algorithms which learn the optimal policy by playing against itself without any direct supervision.
We propose an optimistic variant of the emphNash Q-learning algorithm with sample complexity $tildemathcalO(SAB)$, and a new emphNash V-learning algorithm with sample complexity $tildemathcalO(S(A+B))$.
arXiv Detail & Related papers (2020-06-22T05:00:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.