Related papers: Model-Free Learning for Two-Player Zero-Sum Partially Observable Markov Games with Perfect Recall

Model-Free Learning for Two-Player Zero-Sum Partially Observable Markov Games with Perfect Recall

URL: http://arxiv.org/abs/2106.06279v1
Date: Fri, 11 Jun 2021 09:51:29 GMT
Title: Model-Free Learning for Two-Player Zero-Sum Partially Observable Markov Games with Perfect Recall
Authors: Tadashi Kozuno, Pierre M\'enard, R\'emi Munos, Michal Valko
Abstract summary: We study the problem of learning a Nash equilibrium (NE) in an imperfect information game (IIG) through self-play. We provide the Implicit Exploration Online Mirror Descent (IXOMD) algorithm. IXOMD is a model-free algorithm with a high-probability bound on the convergence rate to the NE of order $1/sqrtT$ where $T$ is the number of played games.
Score: 34.73929457960903
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We study the problem of learning a Nash equilibrium (NE) in an imperfect information game (IIG) through self-play. Precisely, we focus on two-player, zero-sum, episodic, tabular IIG under the perfect-recall assumption where the only feedback is realizations of the game (bandit feedback). In particular, the dynamic of the IIG is not known -- we can only access it by sampling or interacting with a game simulator. For this learning setting, we provide the Implicit Exploration Online Mirror Descent (IXOMD) algorithm. It is a model-free algorithm with a high-probability bound on the convergence rate to the NE of order $1/\sqrt{T}$ where $T$ is the number of played games. Moreover, IXOMD is computationally efficient as it needs to perform the updates only along the sampled trajectory.

Related papers

Learning to Control Unknown Strongly Monotone Games [16.327788209492805]
We propose a simple algorithm that learns to shift the NE of the game to meet the linear constraints by adjusting the controlled coefficients online. We prove that our algorithm, which is based on two time-scale approximation, guarantees convergence with probability 1 to the set of NE that meet target linear constraints. We demonstrate how our scheme can be applied to optimizing a global quadratic cost at NE and load balancing in resource allocation games.
arXiv Detail & Related papers (2024-06-30T03:33:42Z)
Decentralized model-free reinforcement learning in stochastic games with average-reward objective [1.9852463786440127]
We show that our algorithm achieves both sublinear high probability regret of order $T3/4$ and sublinear expected regret of order $T2/3$. Our algorithm enjoys a low computational complexity and low memory space requirement compared to the previous works.
arXiv Detail & Related papers (2023-01-13T15:59:53Z)
Policy Mirror Ascent for Efficient and Independent Learning in Mean Field Games [35.86199604587823]
Mean-field games have been used as a theoretical tool to obtain an approximate Nash equilibrium for symmetric and anonymous $N$-player games. We show that $N$ agents running policy mirror ascent converge to the Nash equilibrium of the regularized game within $widetildemathcalO(varepsilon-2)$ samples.
arXiv Detail & Related papers (2022-12-29T20:25:18Z)
Minimax-Optimal Multi-Agent RL in Zero-Sum Markov Games With a Generative Model [50.38446482252857]
Two-player zero-sum Markov games are arguably the most basic setting in multi-agent reinforcement learning. We develop a learning algorithm that learns an $varepsilon$-approximate Markov NE policy using $$ widetildeObigg. We derive a refined regret bound for FTRL that makes explicit the role of variance-type quantities.
arXiv Detail & Related papers (2022-08-22T17:24:55Z)
Near-Optimal Learning of Extensive-Form Games with Imperfect Information [54.55092907312749]
We present the first line of algorithms that require only $widetildemathcalO((XA+YB)/varepsilon2)$ episodes of play to find an $varepsilon$-approximate Nash equilibrium in two-player zero-sum games. This improves upon the best known sample complexity of $widetildemathcalO((X2A+Y2B)/varepsilon2)$ by a factor of $widetildemathcalO(maxX,
arXiv Detail & Related papers (2022-02-03T18:18:28Z)
Doubly Optimal No-Regret Online Learning in Strongly Monotone Games with Bandit Feedback [29.553652241608997]
We study the class of textitsmooth and strongly monotone games and study optimal no-regret learning therein. We first construct a new bandit learning algorithm and show that it achieves the single-agent optimal regret of $tildeTheta(nsqrtT)$. Our results thus settle this open problem and contribute to the broad landscape of bandit game-theoretical learning.
arXiv Detail & Related papers (2021-12-06T08:27:54Z)
Towards General Function Approximation in Zero-Sum Markov Games [126.58493169301012]
This paper considers two-player zero-sum finite-horizon Markov games with simultaneous moves. Provably efficient algorithms for both decoupled and coordinated settings are developed.
arXiv Detail & Related papers (2021-07-30T15:25:13Z)
Almost Optimal Algorithms for Two-player Markov Games with Linear Function Approximation [92.99933928528797]
We study reinforcement learning for two-player zero-sum Markov games with simultaneous moves. We propose an algorithm Nash-UCRL-VTR based on the principle "Optimism-in-Face-of-Uncertainty" We show that Nash-UCRL-VTR can provably achieve an $tildeO(dHsqrtT)$ regret, where $d$ is the linear function dimension.
arXiv Detail & Related papers (2021-02-15T09:09:16Z)
Combining Deep Reinforcement Learning and Search for Imperfect-Information Games [30.520629802135574]
We present ReBeL, a framework for self-play reinforcement learning and search provably converges to a Nash equilibrium in zero-sum games. We also show ReBeL achieves performance in heads-up no-limit Texas hold'em poker, while using far less domain knowledge than any prior poker AI.
arXiv Detail & Related papers (2020-07-27T15:21:22Z)
Near-Optimal Reinforcement Learning with Self-Play [50.29853537456737]
We focus on self-play algorithms which learn the optimal policy by playing against itself without any direct supervision. We propose an optimistic variant of the emphNash Q-learning algorithm with sample complexity $tildemathcalO(SAB)$, and a new emphNash V-learning algorithm with sample complexity $tildemathcalO(S(A+B))$.
arXiv Detail & Related papers (2020-06-22T05:00:13Z)
Learning Zero-Sum Simultaneous-Move Markov Games Using Function Approximation and Correlated Equilibrium [116.56359444619441]
We develop provably efficient reinforcement learning algorithms for two-player zero-sum finite-horizon Markov games. In the offline setting, we control both players and aim to find the Nash Equilibrium by minimizing the duality gap. In the online setting, we control a single player playing against an arbitrary opponent and aim to minimize the regret.
arXiv Detail & Related papers (2020-02-17T17:04:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.