Related papers: Near-Optimal Reinforcement Learning with Self-Play

Near-Optimal Reinforcement Learning with Self-Play

URL: http://arxiv.org/abs/2006.12007v2
Date: Tue, 14 Jul 2020 04:21:16 GMT
Title: Near-Optimal Reinforcement Learning with Self-Play
Authors: Yu Bai, Chi Jin, Tiancheng Yu
Abstract summary: We focus on self-play algorithms which learn the optimal policy by playing against itself without any direct supervision. We propose an optimistic variant of the emphNash Q-learning algorithm with sample complexity $tildemathcalO(SAB)$, and a new emphNash V-learning algorithm with sample complexity $tildemathcalO(S(A+B))$.
Score: 50.29853537456737
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This paper considers the problem of designing optimal algorithms for reinforcement learning in two-player zero-sum games. We focus on self-play algorithms which learn the optimal policy by playing against itself without any direct supervision. In a tabular episodic Markov game with $S$ states, $A$ max-player actions and $B$ min-player actions, the best existing algorithm for finding an approximate Nash equilibrium requires $\tilde{\mathcal{O}}(S^2AB)$ steps of game playing, when only highlighting the dependency on $(S,A,B)$. In contrast, the best existing lower bound scales as $\Omega(S(A+B))$ and has a significant gap from the upper bound. This paper closes this gap for the first time: we propose an optimistic variant of the \emph{Nash Q-learning} algorithm with sample complexity $\tilde{\mathcal{O}}(SAB)$, and a new \emph{Nash V-learning} algorithm with sample complexity $\tilde{\mathcal{O}}(S(A+B))$. The latter result matches the information-theoretic lower bound in all problem-dependent parameters except for a polynomial factor of the length of each episode. In addition, we present a computational hardness result for learning the best responses against a fixed opponent in Markov games---a learning objective different from finding the Nash equilibrium.

Related papers

Instance-Dependent Regret Bounds for Learning Two-Player Zero-Sum Games with Bandit Feedback [60.610120215789976]
We show that when a pure strategy Nash equilibrium exists, $c$ becomes zero, leading to an optimal instance-dependent regret bound. Our algorithm also enjoys last-iterate convergence and can identify the pure strategy Nash equilibrium with near-optimal sample.
arXiv Detail & Related papers (2025-02-24T20:20:06Z)
Near-Optimal Reinforcement Learning with Self-Play under Adaptivity Constraints [21.412085244757336]
We study the problem of multi-agent reinforcement learning (MARL) with adaptivity constraints. For two-player zero-sum Markov Games, we design a (policy) elimination based algorithm that achieves a regret of $widetildeO(sqrtH3 S2 ABK)$. We prove a batch complexity lower bound $Omega(fracHlog_AK+loglog K)$ for all algorithms with $widetildeO(sqrtK)$
arXiv Detail & Related papers (2024-02-02T03:00:40Z)
Model-Based Reinforcement Learning for Offline Zero-Sum Markov Games [18.832436856339587]
This paper makes progress towards learning Nash equilibria in two-player zero-sum Markov games from offline data. We propose a pessimistic model-based algorithm with Bernstein-style lower confidence bounds -- called VI-LCB-Game.
arXiv Detail & Related papers (2022-06-08T17:58:06Z)
Near-Optimal Learning of Extensive-Form Games with Imperfect Information [54.55092907312749]
We present the first line of algorithms that require only $widetildemathcalO((XA+YB)/varepsilon2)$ episodes of play to find an $varepsilon$-approximate Nash equilibrium in two-player zero-sum games. This improves upon the best known sample complexity of $widetildemathcalO((X2A+Y2B)/varepsilon2)$ by a factor of $widetildemathcalO(maxX,
arXiv Detail & Related papers (2022-02-03T18:18:28Z)
Doubly Optimal No-Regret Online Learning in Strongly Monotone Games with Bandit Feedback [29.553652241608997]
We study the class of textitsmooth and strongly monotone games and study optimal no-regret learning therein. We first construct a new bandit learning algorithm and show that it achieves the single-agent optimal regret of $tildeTheta(nsqrtT)$. Our results thus settle this open problem and contribute to the broad landscape of bandit game-theoretical learning.
arXiv Detail & Related papers (2021-12-06T08:27:54Z)
When Can We Learn General-Sum Markov Games with a Large Number of Players Sample-Efficiently? [10.397170312149067]
This paper investigates what learning goals admit better sample complexities in the setting of $m$-player general-sum Markov games. First, we design algorithms for learning an $epsilon$-Coarse Correlated Equilibrium (CCE) in $widetildemathcalO(H5Smax_ile m A_i / epsilon2)$ episodes. Second, we consider the important special case of Markov Potential Games, and design an algorithm that learns an $epsilon$-approximate Nash equilibrium within $widet
arXiv Detail & Related papers (2021-10-08T15:06:22Z)
Towards General Function Approximation in Zero-Sum Markov Games [126.58493169301012]
This paper considers two-player zero-sum finite-horizon Markov games with simultaneous moves. Provably efficient algorithms for both decoupled and coordinated settings are developed.
arXiv Detail & Related papers (2021-07-30T15:25:13Z)
Almost Optimal Algorithms for Two-player Markov Games with Linear Function Approximation [92.99933928528797]
We study reinforcement learning for two-player zero-sum Markov games with simultaneous moves. We propose an algorithm Nash-UCRL-VTR based on the principle "Optimism-in-Face-of-Uncertainty" We show that Nash-UCRL-VTR can provably achieve an $tildeO(dHsqrtT)$ regret, where $d$ is the linear function dimension.
arXiv Detail & Related papers (2021-02-15T09:09:16Z)
A Sharp Analysis of Model-based Reinforcement Learning with Self-Play [49.88233710867315]
We present a sharp analysis of model-based self-play algorithms for multi-agent Markov games. We design an algorithm -- Optimistic Nash Value Iteration (Nash-VI) for two-player zero-sum Markov games. We further adapt our analysis to designing a provably efficient task-agnostic algorithm for zero-sum Markov games.
arXiv Detail & Related papers (2020-10-04T15:27:39Z)
Learning Zero-Sum Simultaneous-Move Markov Games Using Function Approximation and Correlated Equilibrium [116.56359444619441]
We develop provably efficient reinforcement learning algorithms for two-player zero-sum finite-horizon Markov games. In the offline setting, we control both players and aim to find the Nash Equilibrium by minimizing the duality gap. In the online setting, we control a single player playing against an arbitrary opponent and aim to minimize the regret.
arXiv Detail & Related papers (2020-02-17T17:04:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.