A Finite-Sample Analysis of Payoff-Based Independent Learning in
Zero-Sum Stochastic Games
- URL: http://arxiv.org/abs/2303.03100v1
- Date: Fri, 3 Mar 2023 05:01:41 GMT
- Title: A Finite-Sample Analysis of Payoff-Based Independent Learning in
Zero-Sum Stochastic Games
- Authors: Zaiwei Chen, Kaiqing Zhang, Eric Mazumdar, Asuman Ozdaglar, and Adam
Wierman
- Abstract summary: We study two-player zero-sum games, and propose a form of independent learning dynamics called Doubly Smoothed Best-Response dynamics.
The resulting dynamics are payoff-based, convergent, rational, and symmetric among players.
- Score: 22.62123576833411
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We study two-player zero-sum stochastic games, and propose a form of
independent learning dynamics called Doubly Smoothed Best-Response dynamics,
which integrates a discrete and doubly smoothed variant of the best-response
dynamics into temporal-difference (TD)-learning and minimax value iteration.
The resulting dynamics are payoff-based, convergent, rational, and symmetric
among players. Our main results provide finite-sample guarantees. In
particular, we prove the first-known $\tilde{\mathcal{O}}(1/\epsilon^2)$ sample
complexity bound for payoff-based independent learning dynamics, up to a
smoothing bias. In the special case where the stochastic game has only one
state (i.e., matrix games), we provide a sharper
$\tilde{\mathcal{O}}(1/\epsilon)$ sample complexity. Our analysis uses a novel
coupled Lyapunov drift approach to capture the evolution of multiple sets of
coupled and stochastic iterates, which might be of independent interest.
Related papers
- Last-Iterate Convergence of Payoff-Based Independent Learning in Zero-Sum Stochastic Games [31.554420227087043]
We develop learning dynamics that are payoff-based, convergent, rational, and symmetric between the two players.
In the matrix game setting, the results imply a complexity of $O(epsilon-1)$ to find the Nash distribution.
In the game setting, the results also imply a complexity of $O(epsilon-8)$ to find a Nash equilibrium.
arXiv Detail & Related papers (2024-09-02T20:07:25Z) - Provably Fast Convergence of Independent Natural Policy Gradient for
Markov Potential Games [18.11805544026393]
This work studies an independent natural policy gradient (NPG) algorithm for the multi-agent reinforcement learning problem in Markov potential games.
It is shown that, under mild technical assumptions, the independent NPG method with an oracle providing exact policy evaluationally reaches an $epsilon$-Nash Equilibrium (NE) within $mathcalO (1/epsilon)$ iterations.
arXiv Detail & Related papers (2023-10-15T04:10:44Z) - Provably Robust Temporal Difference Learning for Heavy-Tailed Rewards [27.209606183563853]
We establish that temporal difference (TD) learning with a dynamic gradient clipping mechanism can be provably robustified against heavy-tailed reward distributions.
We show that a robust variant of NAC based on TD learning achieves $tildemathcalO(varepsilon-frac1p)$ sample complexity.
arXiv Detail & Related papers (2023-06-20T11:12:21Z) - Global Nash Equilibrium in Non-convex Multi-player Game: Theory and
Algorithms [66.8634598612777]
We show that Nash equilibrium (NE) is acceptable to all players in a multi-player game.
We also show that no one can benefit unilaterally from the general theory step by step.
arXiv Detail & Related papers (2023-01-19T11:36:50Z) - On-Demand Sampling: Learning Optimally from Multiple Distributions [63.20009081099896]
Social and real-world considerations have given rise to multi-distribution learning paradigms.
We establish the optimal sample complexity of these learning paradigms and give algorithms that meet this sample complexity.
Our algorithm design and analysis are enabled by our extensions of online learning techniques for solving zero-sum games.
arXiv Detail & Related papers (2022-10-22T19:07:26Z) - Minimax-Optimal Multi-Agent RL in Zero-Sum Markov Games With a
Generative Model [50.38446482252857]
Two-player zero-sum Markov games are arguably the most basic setting in multi-agent reinforcement learning.
We develop a learning algorithm that learns an $varepsilon$-approximate Markov NE policy using $$ widetildeObigg.
We derive a refined regret bound for FTRL that makes explicit the role of variance-type quantities.
arXiv Detail & Related papers (2022-08-22T17:24:55Z) - Tightening the Dependence on Horizon in the Sample Complexity of
Q-Learning [59.71676469100807]
This work sharpens the sample complexity of synchronous Q-learning to an order of $frac|mathcalS|| (1-gamma)4varepsilon2$ for any $0varepsilon 1$.
Our finding unveils the effectiveness of vanilla Q-learning, which matches that of speedy Q-learning without requiring extra computation and storage.
arXiv Detail & Related papers (2021-02-12T14:22:05Z) - Model-Based Multi-Agent RL in Zero-Sum Markov Games with Near-Optimal
Sample Complexity [67.02490430380415]
We show that model-based MARL achieves a sample complexity of $tilde O(|S||B|(gamma)-3epsilon-2)$ for finding the Nash equilibrium (NE) value up to some $epsilon$ error.
We also show that such a sample bound is minimax-optimal (up to logarithmic factors) if the algorithm is reward-agnostic, where the algorithm queries state transition samples without reward knowledge.
arXiv Detail & Related papers (2020-07-15T03:25:24Z) - Sample Complexity of Asynchronous Q-Learning: Sharper Analysis and
Variance Reduction [63.41789556777387]
Asynchronous Q-learning aims to learn the optimal action-value function (or Q-function) of a Markov decision process (MDP)
We show that the number of samples needed to yield an entrywise $varepsilon$-accurate estimate of the Q-function is at most on the order of $frac1mu_min (1-gamma)5varepsilon2+ fract_mixmu_min (1-gamma)$ up to some logarithmic factor.
arXiv Detail & Related papers (2020-06-04T17:51:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.