Related papers: Model-Based Multi-Agent RL in Zero-Sum Markov Games with Near-Optimal Sample Complexity

Model-Based Multi-Agent RL in Zero-Sum Markov Games with Near-Optimal Sample Complexity

URL: http://arxiv.org/abs/2007.07461v3
Date: Tue, 8 Aug 2023 22:36:08 GMT
Title: Model-Based Multi-Agent RL in Zero-Sum Markov Games with Near-Optimal Sample Complexity
Authors: Kaiqing Zhang, Sham M. Kakade, Tamer Ba\c{s}ar, Lin F. Yang
Abstract summary: We show that model-based MARL achieves a sample complexity of $tilde O(|S||B|(gamma)-3epsilon-2)$ for finding the Nash equilibrium (NE) value up to some $epsilon$ error. We also show that such a sample bound is minimax-optimal (up to logarithmic factors) if the algorithm is reward-agnostic, where the algorithm queries state transition samples without reward knowledge.
Score: 67.02490430380415
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Model-based reinforcement learning (RL), which finds an optimal policy using an empirical model, has long been recognized as one of the corner stones of RL. It is especially suitable for multi-agent RL (MARL), as it naturally decouples the learning and the planning phases, and avoids the non-stationarity problem when all agents are improving their policies simultaneously using samples. Though intuitive and widely-used, the sample complexity of model-based MARL algorithms has not been fully investigated. In this paper, our goal is to address the fundamental question about its sample complexity. We study arguably the most basic MARL setting: two-player discounted zero-sum Markov games, given only access to a generative model. We show that model-based MARL achieves a sample complexity of $\tilde O(|S||A||B|(1-\gamma)^{-3}\epsilon^{-2})$ for finding the Nash equilibrium (NE) value up to some $\epsilon$ error, and the $\epsilon$-NE policies with a smooth planning oracle, where $\gamma$ is the discount factor, and $S,A,B$ denote the state space, and the action spaces for the two agents. We further show that such a sample bound is minimax-optimal (up to logarithmic factors) if the algorithm is reward-agnostic, where the algorithm queries state transition samples without reward knowledge, by establishing a matching lower bound. This is in contrast to the usual reward-aware setting, with a $\tilde\Omega(|S|(|A|+|B|)(1-\gamma)^{-3}\epsilon^{-2})$ lower bound, where this model-based approach is near-optimal with only a gap on the $|A|,|B|$ dependence. Our results not only demonstrate the sample-efficiency of this basic model-based approach in MARL, but also elaborate on the fundamental tradeoff between its power (easily handling the more challenging reward-agnostic case) and limitation (less adaptive and suboptimal in $|A|,|B|$), particularly arises in the multi-agent context.

Related papers

Minimax-Optimal Multi-Agent Robust Reinforcement Learning [7.237817437521988]
We extend the Q-FTRL algorithm citepli2022minimax to the RMGs in finite-horizon setting, assuming access to a generative model. We prove that the proposed algorithm achieves an $varepsilon$-robust coarse correlated equilibrium (CCE) with a sample complexity (up to log factors) of $widetildeOleft(H3Ssum_i=1mA_iminleftH,1/Rright/varepsilon2right), where $S$ denotes the
arXiv Detail & Related papers (2024-12-27T16:37:33Z)
Improving Sample Efficiency of Model-Free Algorithms for Zero-Sum Markov Games [66.2085181793014]
We show that a model-free stage-based Q-learning algorithm can enjoy the same optimality in the $H$ dependence as model-based algorithms. Our algorithm features a key novel design of updating the reference value functions as the pair of optimistic and pessimistic value functions.
arXiv Detail & Related papers (2023-08-17T08:34:58Z)
Minimax-Optimal Multi-Agent RL in Zero-Sum Markov Games With a Generative Model [50.38446482252857]
Two-player zero-sum Markov games are arguably the most basic setting in multi-agent reinforcement learning. We develop a learning algorithm that learns an $varepsilon$-approximate Markov NE policy using $$ widetildeObigg. We derive a refined regret bound for FTRL that makes explicit the role of variance-type quantities.
arXiv Detail & Related papers (2022-08-22T17:24:55Z)
Breaking the Sample Complexity Barrier to Regret-Optimal Model-Free Reinforcement Learning [52.76230802067506]
A novel model-free algorithm is proposed to minimize regret in episodic reinforcement learning. The proposed algorithm employs an em early-settled reference update rule, with the aid of two Q-learning sequences. The design principle of our early-settled variance reduction method might be of independent interest to other RL settings.
arXiv Detail & Related papers (2021-10-09T21:13:48Z)
Is Plug-in Solver Sample-Efficient for Feature-based Reinforcement Learning? [30.065091907118827]
This work considers sample complexity of finding an $epsilon$-optimal policy in a Markov decision process (MDP) We solve this problem via a plug-in solver approach, which builds an empirical model and plans in this empirical model via an arbitrary plug-in solver. We show that a plug-in approach can be sample efficient as well, providing a flexible approach to design model-based algorithms for reinforcement learning.
arXiv Detail & Related papers (2020-10-12T13:13:01Z)
A Sharp Analysis of Model-based Reinforcement Learning with Self-Play [49.88233710867315]
We present a sharp analysis of model-based self-play algorithms for multi-agent Markov games. We design an algorithm -- Optimistic Nash Value Iteration (Nash-VI) for two-player zero-sum Markov games. We further adapt our analysis to designing a provably efficient task-agnostic algorithm for zero-sum Markov games.
arXiv Detail & Related papers (2020-10-04T15:27:39Z)
Breaking the Sample Size Barrier in Model-Based Reinforcement Learning with a Generative Model [50.38446482252857]
This paper is concerned with the sample efficiency of reinforcement learning, assuming access to a generative model (or simulator) We first consider $gamma$-discounted infinite-horizon Markov decision processes (MDPs) with state space $mathcalS$ and action space $mathcalA$. We prove that a plain model-based planning algorithm suffices to achieve minimax-optimal sample complexity given any target accuracy level.
arXiv Detail & Related papers (2020-05-26T17:53:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.