Model-Based Reinforcement Learning for Offline Zero-Sum Markov Games
- URL: http://arxiv.org/abs/2206.04044v2
- Date: Mon, 26 Feb 2024 15:18:31 GMT
- Title: Model-Based Reinforcement Learning for Offline Zero-Sum Markov Games
- Authors: Yuling Yan and Gen Li and Yuxin Chen and Jianqing Fan
- Abstract summary: This paper makes progress towards learning Nash equilibria in two-player zero-sum Markov games from offline data.
We propose a pessimistic model-based algorithm with Bernstein-style lower confidence bounds -- called VI-LCB-Game.
- Score: 18.832436856339587
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper makes progress towards learning Nash equilibria in two-player
zero-sum Markov games from offline data. Specifically, consider a
$\gamma$-discounted infinite-horizon Markov game with $S$ states, where the
max-player has $A$ actions and the min-player has $B$ actions. We propose a
pessimistic model-based algorithm with Bernstein-style lower confidence bounds
-- called VI-LCB-Game -- that provably finds an $\varepsilon$-approximate Nash
equilibrium with a sample complexity no larger than
$\frac{C_{\mathsf{clipped}}^{\star}S(A+B)}{(1-\gamma)^{3}\varepsilon^{2}}$ (up
to some log factor). Here, $C_{\mathsf{clipped}}^{\star}$ is some unilateral
clipped concentrability coefficient that reflects the coverage and distribution
shift of the available data (vis-\`a-vis the target data), and the target
accuracy $\varepsilon$ can be any value within
$\big(0,\frac{1}{1-\gamma}\big]$. Our sample complexity bound strengthens prior
art by a factor of $\min\{A,B\}$, achieving minimax optimality for the entire
$\varepsilon$-range. An appealing feature of our result lies in algorithmic
simplicity, which reveals the unnecessity of variance reduction and sample
splitting in achieving sample optimality.
Related papers
- Minimax-Optimal Multi-Agent RL in Zero-Sum Markov Games With a
Generative Model [50.38446482252857]
Two-player zero-sum Markov games are arguably the most basic setting in multi-agent reinforcement learning.
We develop a learning algorithm that learns an $varepsilon$-approximate Markov NE policy using $$ widetildeObigg.
We derive a refined regret bound for FTRL that makes explicit the role of variance-type quantities.
arXiv Detail & Related papers (2022-08-22T17:24:55Z) - Settling the Sample Complexity of Model-Based Offline Reinforcement
Learning [50.5790774201146]
offline reinforcement learning (RL) learns using pre-collected data without further exploration.
Prior algorithms or analyses either suffer from suboptimal sample complexities or incur high burn-in cost to reach sample optimality.
We demonstrate that the model-based (or "plug-in") approach achieves minimax-optimal sample complexity without burn-in cost.
arXiv Detail & Related papers (2022-04-11T17:26:19Z) - Near-Optimal Learning of Extensive-Form Games with Imperfect Information [54.55092907312749]
We present the first line of algorithms that require only $widetildemathcalO((XA+YB)/varepsilon2)$ episodes of play to find an $varepsilon$-approximate Nash equilibrium in two-player zero-sum games.
This improves upon the best known sample complexity of $widetildemathcalO((X2A+Y2B)/varepsilon2)$ by a factor of $widetildemathcalO(maxX,
arXiv Detail & Related papers (2022-02-03T18:18:28Z) - When Can We Learn General-Sum Markov Games with a Large Number of
Players Sample-Efficiently? [10.397170312149067]
This paper investigates what learning goals admit better sample complexities in the setting of $m$-player general-sum Markov games.
First, we design algorithms for learning an $epsilon$-Coarse Correlated Equilibrium (CCE) in $widetildemathcalO(H5Smax_ile m A_i / epsilon2)$ episodes.
Second, we consider the important special case of Markov Potential Games, and design an algorithm that learns an $epsilon$-approximate Nash equilibrium within $widet
arXiv Detail & Related papers (2021-10-08T15:06:22Z) - A Sharp Analysis of Model-based Reinforcement Learning with Self-Play [49.88233710867315]
We present a sharp analysis of model-based self-play algorithms for multi-agent Markov games.
We design an algorithm -- Optimistic Nash Value Iteration (Nash-VI) for two-player zero-sum Markov games.
We further adapt our analysis to designing a provably efficient task-agnostic algorithm for zero-sum Markov games.
arXiv Detail & Related papers (2020-10-04T15:27:39Z) - Near-Optimal Reinforcement Learning with Self-Play [50.29853537456737]
We focus on self-play algorithms which learn the optimal policy by playing against itself without any direct supervision.
We propose an optimistic variant of the emphNash Q-learning algorithm with sample complexity $tildemathcalO(SAB)$, and a new emphNash V-learning algorithm with sample complexity $tildemathcalO(S(A+B))$.
arXiv Detail & Related papers (2020-06-22T05:00:13Z) - Model-Free Reinforcement Learning: from Clipped Pseudo-Regret to Sample
Complexity [59.34067736545355]
Given an MDP with $S$ states, $A$ actions, the discount factor $gamma in (0,1)$, and an approximation threshold $epsilon > 0$, we provide a model-free algorithm to learn an $epsilon$-optimal policy.
For small enough $epsilon$, we show an improved algorithm with sample complexity.
arXiv Detail & Related papers (2020-06-06T13:34:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.