Representation Learning for General-sum Low-rank Markov Games
- URL: http://arxiv.org/abs/2210.16976v1
- Date: Sun, 30 Oct 2022 22:58:22 GMT
- Title: Representation Learning for General-sum Low-rank Markov Games
- Authors: Chengzhuo Ni, Yuda Song, Xuezhou Zhang, Chi Jin, Mengdi Wang
- Abstract summary: We study multi-agent general-sum Markov games with nonlinear function approximation.
We focus on low-rank Markov games whose transition matrix admits a hidden low-rank structure on top of an unknown non-linear representation.
- Score: 63.119870889883224
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We study multi-agent general-sum Markov games with nonlinear function
approximation. We focus on low-rank Markov games whose transition matrix admits
a hidden low-rank structure on top of an unknown non-linear representation. The
goal is to design an algorithm that (1) finds an $\varepsilon$-equilibrium
policy sample efficiently without prior knowledge of the environment or the
representation, and (2) permits a deep-learning friendly implementation. We
leverage representation learning and present a model-based and a model-free
approach to construct an effective representation from the collected data. For
both approaches, the algorithm achieves a sample complexity of
poly$(H,d,A,1/\varepsilon)$, where $H$ is the game horizon, $d$ is the
dimension of the feature vector, $A$ is the size of the joint action space and
$\varepsilon$ is the optimality gap. When the number of players is large, the
above sample complexity can scale exponentially with the number of players in
the worst case. To address this challenge, we consider Markov games with a
factorized transition structure and present an algorithm that escapes such
exponential scaling. To our best knowledge, this is the first sample-efficient
algorithm for multi-agent general-sum Markov games that incorporates
(non-linear) function approximation. We accompany our theoretical result with a
neural network-based implementation of our algorithm and evaluate it against
the widely used deep RL baseline, DQN with fictitious play.
Related papers
- Improving Sample Efficiency of Model-Free Algorithms for Zero-Sum Markov Games [66.2085181793014]
We show that a model-free stage-based Q-learning algorithm can enjoy the same optimality in the $H$ dependence as model-based algorithms.
Our algorithm features a key novel design of updating the reference value functions as the pair of optimistic and pessimistic value functions.
arXiv Detail & Related papers (2023-08-17T08:34:58Z) - Minimax-Optimal Multi-Agent RL in Zero-Sum Markov Games With a
Generative Model [50.38446482252857]
Two-player zero-sum Markov games are arguably the most basic setting in multi-agent reinforcement learning.
We develop a learning algorithm that learns an $varepsilon$-approximate Markov NE policy using $$ widetildeObigg.
We derive a refined regret bound for FTRL that makes explicit the role of variance-type quantities.
arXiv Detail & Related papers (2022-08-22T17:24:55Z) - Towards General Function Approximation in Zero-Sum Markov Games [126.58493169301012]
This paper considers two-player zero-sum finite-horizon Markov games with simultaneous moves.
Provably efficient algorithms for both decoupled and coordinated settings are developed.
arXiv Detail & Related papers (2021-07-30T15:25:13Z) - A Sharp Analysis of Model-based Reinforcement Learning with Self-Play [49.88233710867315]
We present a sharp analysis of model-based self-play algorithms for multi-agent Markov games.
We design an algorithm -- Optimistic Nash Value Iteration (Nash-VI) for two-player zero-sum Markov games.
We further adapt our analysis to designing a provably efficient task-agnostic algorithm for zero-sum Markov games.
arXiv Detail & Related papers (2020-10-04T15:27:39Z) - Model-Based Multi-Agent RL in Zero-Sum Markov Games with Near-Optimal
Sample Complexity [67.02490430380415]
We show that model-based MARL achieves a sample complexity of $tilde O(|S||B|(gamma)-3epsilon-2)$ for finding the Nash equilibrium (NE) value up to some $epsilon$ error.
We also show that such a sample bound is minimax-optimal (up to logarithmic factors) if the algorithm is reward-agnostic, where the algorithm queries state transition samples without reward knowledge.
arXiv Detail & Related papers (2020-07-15T03:25:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.