Related papers: Last-Iterate Convergence of Payoff-Based Independent Learning in Zero-Sum Stochastic Games

Last-Iterate Convergence of Payoff-Based Independent Learning in Zero-Sum Stochastic Games

URL: http://arxiv.org/abs/2409.01447v2
Date: Thu, 5 Sep 2024 02:16:17 GMT
Title: Last-Iterate Convergence of Payoff-Based Independent Learning in Zero-Sum Stochastic Games
Authors: Zaiwei Chen, Kaiqing Zhang, Eric Mazumdar, Asuman Ozdaglar, Adam Wierman,
Abstract summary: We develop learning dynamics that are payoff-based, convergent, rational, and symmetric between the two players. In the matrix game setting, the results imply a complexity of $O(epsilon-1)$ to find the Nash distribution. In the game setting, the results also imply a complexity of $O(epsilon-8)$ to find a Nash equilibrium.
Score: 31.554420227087043
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In this paper, we consider two-player zero-sum matrix and stochastic games and develop learning dynamics that are payoff-based, convergent, rational, and symmetric between the two players. Specifically, the learning dynamics for matrix games are based on the smoothed best-response dynamics, while the learning dynamics for stochastic games build upon those for matrix games, with additional incorporation of the minimax value iteration. To our knowledge, our theoretical results present the first finite-sample analysis of such learning dynamics with last-iterate guarantees. In the matrix game setting, the results imply a sample complexity of $O(\epsilon^{-1})$ to find the Nash distribution and a sample complexity of $O(\epsilon^{-8})$ to find a Nash equilibrium. In the stochastic game setting, the results also imply a sample complexity of $O(\epsilon^{-8})$ to find a Nash equilibrium. To establish these results, the main challenge is to handle stochastic approximation algorithms with multiple sets of coupled and stochastic iterates that evolve on (possibly) different time scales. To overcome this challenge, we developed a coupled Lyapunov-based approach, which may be of independent interest to the broader community studying the convergence behavior of stochastic approximation algorithms.

Related papers

Instance-Dependent Regret Bounds for Learning Two-Player Zero-Sum Games with Bandit Feedback [60.610120215789976]
We show that when a pure strategy Nash equilibrium exists, $c$ becomes zero, leading to an optimal instance-dependent regret bound. Our algorithm also enjoys last-iterate convergence and can identify the pure strategy Nash equilibrium with near-optimal sample.
arXiv Detail & Related papers (2025-02-24T20:20:06Z)
Nash Equilibria via Stochastic Eigendecomposition [4.190518009892366]
We show a Nash equilibrium can be approximated with purely calls to parameter, iterative variants of value decomposition and power. We provide pseudocode and experiments demonstrating solving for all equilibria of a general-sum game using only readily available linear algebra tools.
arXiv Detail & Related papers (2024-11-04T17:32:21Z)
Scalable and Independent Learning of Nash Equilibrium Policies in $n$-Player Stochastic Games with Unknown Independent Chains [1.0878040851638]
We study games with independent chains and unknown transition matrices. In this class of games, players control their own internal Markov chains whose transitions do not depend on the states/actions of other players. We propose a fully decentralized mirror descent algorithm to learn an $epsilon$-NE policy.
arXiv Detail & Related papers (2023-12-04T03:04:09Z)
Learning Zero-Sum Linear Quadratic Games with Improved Sample Complexity and Last-Iterate Convergence [19.779044926914704]
Zero-sum Linear Quadratic (LQ) games are fundamental in optimal control. In this work, we propose a simpler nested Zeroth-Order (NPG) algorithm.
arXiv Detail & Related papers (2023-09-08T11:47:31Z)
A Finite-Sample Analysis of Payoff-Based Independent Learning in Zero-Sum Stochastic Games [22.62123576833411]
We study two-player zero-sum games, and propose a form of independent learning dynamics called Doubly Smoothed Best-Response dynamics. The resulting dynamics are payoff-based, convergent, rational, and symmetric among players.
arXiv Detail & Related papers (2023-03-03T05:01:41Z)
Global Nash Equilibrium in Non-convex Multi-player Game: Theory and Algorithms [66.8634598612777]
We show that Nash equilibrium (NE) is acceptable to all players in a multi-player game. We also show that no one can benefit unilaterally from the general theory step by step.
arXiv Detail & Related papers (2023-01-19T11:36:50Z)
Representation Learning for General-sum Low-rank Markov Games [63.119870889883224]
We study multi-agent general-sum Markov games with nonlinear function approximation. We focus on low-rank Markov games whose transition matrix admits a hidden low-rank structure on top of an unknown non-linear representation.
arXiv Detail & Related papers (2022-10-30T22:58:22Z)
On-Demand Sampling: Learning Optimally from Multiple Distributions [63.20009081099896]
Social and real-world considerations have given rise to multi-distribution learning paradigms. We establish the optimal sample complexity of these learning paradigms and give algorithms that meet this sample complexity. Our algorithm design and analysis are enabled by our extensions of online learning techniques for solving zero-sum games.
arXiv Detail & Related papers (2022-10-22T19:07:26Z)
Learning Two-Player Mixture Markov Games: Kernel Function Approximation and Correlated Equilibrium [157.0902680672422]
We consider learning Nash equilibria in two-player zero-sum Markov Games with nonlinear function approximation. We propose a novel online learning algorithm to find a Nash equilibrium by minimizing the duality gap.
arXiv Detail & Related papers (2022-08-10T14:21:54Z)
Near-Optimal Learning of Extensive-Form Games with Imperfect Information [54.55092907312749]
We present the first line of algorithms that require only $widetildemathcalO((XA+YB)/varepsilon2)$ episodes of play to find an $varepsilon$-approximate Nash equilibrium in two-player zero-sum games. This improves upon the best known sample complexity of $widetildemathcalO((X2A+Y2B)/varepsilon2)$ by a factor of $widetildemathcalO(maxX,
arXiv Detail & Related papers (2022-02-03T18:18:28Z)
Near-Optimal Reinforcement Learning with Self-Play [50.29853537456737]
We focus on self-play algorithms which learn the optimal policy by playing against itself without any direct supervision. We propose an optimistic variant of the emphNash Q-learning algorithm with sample complexity $tildemathcalO(SAB)$, and a new emphNash V-learning algorithm with sample complexity $tildemathcalO(S(A+B))$.
arXiv Detail & Related papers (2020-06-22T05:00:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.