A Sharp Analysis of Model-based Reinforcement Learning with Self-Play
- URL: http://arxiv.org/abs/2010.01604v2
- Date: Sun, 7 Feb 2021 03:38:01 GMT
- Title: A Sharp Analysis of Model-based Reinforcement Learning with Self-Play
- Authors: Qinghua Liu, Tiancheng Yu, Yu Bai, Chi Jin
- Abstract summary: We present a sharp analysis of model-based self-play algorithms for multi-agent Markov games.
We design an algorithm -- Optimistic Nash Value Iteration (Nash-VI) for two-player zero-sum Markov games.
We further adapt our analysis to designing a provably efficient task-agnostic algorithm for zero-sum Markov games.
- Score: 49.88233710867315
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Model-based algorithms -- algorithms that explore the environment through
building and utilizing an estimated model -- are widely used in reinforcement
learning practice and theoretically shown to achieve optimal sample efficiency
for single-agent reinforcement learning in Markov Decision Processes (MDPs).
However, for multi-agent reinforcement learning in Markov games, the current
best known sample complexity for model-based algorithms is rather suboptimal
and compares unfavorably against recent model-free approaches. In this paper,
we present a sharp analysis of model-based self-play algorithms for multi-agent
Markov games. We design an algorithm -- Optimistic Nash Value Iteration
(Nash-VI) for two-player zero-sum Markov games that is able to output an
$\epsilon$-approximate Nash policy in $\tilde{\mathcal{O}}(H^3SAB/\epsilon^2)$
episodes of game playing, where $S$ is the number of states, $A,B$ are the
number of actions for the two players respectively, and $H$ is the horizon
length. This significantly improves over the best known model-based guarantee
of $\tilde{\mathcal{O}}(H^4S^2AB/\epsilon^2)$, and is the first that matches
the information-theoretic lower bound $\Omega(H^3S(A+B)/\epsilon^2)$ except for
a $\min\{A,B\}$ factor. In addition, our guarantee compares favorably against
the best known model-free algorithm if $\min \{A,B\}=o(H^3)$, and outputs a
single Markov policy while existing sample-efficient model-free algorithms
output a nested mixture of Markov policies that is in general non-Markov and
rather inconvenient to store and execute. We further adapt our analysis to
designing a provably efficient task-agnostic algorithm for zero-sum Markov
games, and designing the first line of provably sample-efficient algorithms for
multi-player general-sum Markov games.
Related papers
- Improving Sample Efficiency of Model-Free Algorithms for Zero-Sum Markov Games [66.2085181793014]
We show that a model-free stage-based Q-learning algorithm can enjoy the same optimality in the $H$ dependence as model-based algorithms.
Our algorithm features a key novel design of updating the reference value functions as the pair of optimistic and pessimistic value functions.
arXiv Detail & Related papers (2023-08-17T08:34:58Z) - Representation Learning for General-sum Low-rank Markov Games [63.119870889883224]
We study multi-agent general-sum Markov games with nonlinear function approximation.
We focus on low-rank Markov games whose transition matrix admits a hidden low-rank structure on top of an unknown non-linear representation.
arXiv Detail & Related papers (2022-10-30T22:58:22Z) - Minimax-Optimal Multi-Agent RL in Zero-Sum Markov Games With a
Generative Model [50.38446482252857]
Two-player zero-sum Markov games are arguably the most basic setting in multi-agent reinforcement learning.
We develop a learning algorithm that learns an $varepsilon$-approximate Markov NE policy using $$ widetildeObigg.
We derive a refined regret bound for FTRL that makes explicit the role of variance-type quantities.
arXiv Detail & Related papers (2022-08-22T17:24:55Z) - Towards General Function Approximation in Zero-Sum Markov Games [126.58493169301012]
This paper considers two-player zero-sum finite-horizon Markov games with simultaneous moves.
Provably efficient algorithms for both decoupled and coordinated settings are developed.
arXiv Detail & Related papers (2021-07-30T15:25:13Z) - Model-Based Multi-Agent RL in Zero-Sum Markov Games with Near-Optimal
Sample Complexity [67.02490430380415]
We show that model-based MARL achieves a sample complexity of $tilde O(|S||B|(gamma)-3epsilon-2)$ for finding the Nash equilibrium (NE) value up to some $epsilon$ error.
We also show that such a sample bound is minimax-optimal (up to logarithmic factors) if the algorithm is reward-agnostic, where the algorithm queries state transition samples without reward knowledge.
arXiv Detail & Related papers (2020-07-15T03:25:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.