On the Limitations and Possibilities of Nash Regret Minimization in Zero-Sum Matrix Games under Noisy Feedback
- URL: http://arxiv.org/abs/2306.13233v3
- Date: Sat, 24 May 2025 01:53:41 GMT
- Title: On the Limitations and Possibilities of Nash Regret Minimization in Zero-Sum Matrix Games under Noisy Feedback
- Authors: Arnab Maiti, Kevin Jamieson, Lillian J. Ratliff,
- Abstract summary: Nash regret is the difference between the player's total reward and the game's Nash equilibrium value scaled by the time horizon $T$.<n>We show that standard algorithms incur Nash regret even when the row player receives noisy feedback on the entire matrix $A$.<n>We present the first algorithm for general $n times m$ matrix games that achieves instance-dependent $textpolylog(T)$ Nash regret.
- Score: 21.332966440675758
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper studies a variant of two-player zero-sum matrix games, where, at each timestep, the row player selects row $i$, the column player selects column $j$, and the row player receives a noisy reward with expected value $A_{i,j}$, along with noisy feedback on the input matrix $A$. The row player's goal is to maximize their total reward against an adversarial column player. Nash regret, defined as the difference between the player's total reward and the game's Nash equilibrium value scaled by the time horizon $T$, is often used to evaluate algorithmic performance in zero-sum games. We begin by studying the limitations of existing algorithms for minimizing Nash regret. We show that standard algorithm--including Hedge, FTRL, and OMD--as well as the strategy of playing the Nash equilibrium of the empirical matrix--all incur $\Omega(\sqrt{T})$ Nash regret, even when the row player receives noisy feedback on the entire matrix $A$. Furthermore, we show that UCB for matrix games, a natural adaptation of the well-known bandit algorithm, also suffers $\Omega(\sqrt{T})$ Nash regret under bandit feedback. Notably, these lower bounds hold even in the simplest case of $2 \times 2$ matrix games, where the instance-dependent matrix parameters are constant. We next ask whether instance-dependent $\text{polylog}(T)$ Nash regret is achievable against adversarial opponents. We answer this affirmatively. In the full-information setting, we present the first algorithm for general $n \times m$ matrix games that achieves instance-dependent $\text{polylog}(T)$ Nash regret. In the bandit feedback setting, we design an algorithm with similar guarantees for the special case of $2 \times 2$ game--the same regime in which existing algorithms provably suffer $\Omega(\sqrt{T})$ regret despite the simplicity of the instance. Finally, we validate our theoretical results with empirical evidence.
Related papers
- Instance-Dependent Regret Bounds for Learning Two-Player Zero-Sum Games with Bandit Feedback [60.610120215789976]
We show that when a pure strategy Nash equilibrium exists, $c$ becomes zero, leading to an optimal instance-dependent regret bound.
Our algorithm also enjoys last-iterate convergence and can identify the pure strategy Nash equilibrium with near-optimal sample.
arXiv Detail & Related papers (2025-02-24T20:20:06Z) - Best of Both Worlds: Regret Minimization versus Minimax Play [57.68976579579758]
We show that our results allow us to guarantee to risk at most $O(1)$ loss while being able to gain $Omega(T)$ from exploitable opponents.
arXiv Detail & Related papers (2025-02-17T11:04:01Z) - Blackwell's Approachability with Approximation Algorithms [15.266876140036052]
We consider a repeated vector-valued game between a player and an adversary.
In case only the player's or adversary's set is equipped with an approximation algorithm, we give simpler and more efficient algorithms.
arXiv Detail & Related papers (2025-02-06T09:54:00Z) - Fast Last-Iterate Convergence of Learning in Games Requires Forgetful Algorithms [71.73971094342349]
Self-play via online learning is one of the premier ways to solve large-scale two-player zero-sum games.<n>We show that OMWU offers several advantages including logarithmic dependence on the size of the payoff matrix.<n>We prove that a broad class of algorithms that do not forget the past quickly all suffer the same issue.
arXiv Detail & Related papers (2024-06-15T13:26:17Z) - Near-Optimal Pure Exploration in Matrix Games: A Generalization of
Stochastic Bandits & Dueling Bandits [21.49682459678413]
We study the sample complexity of identifying the pure strategy Nash equilibrium (PSNE) in a two-player zero-sum matrix game with noise.
We find a near-optimal algorithm whose complexity matches lower bound, up to log factors.
The problem of identifying the PSNE is also generalizes the problem of pure exploration in multi-armed bandits and dueling bandits, and our result matches the optimal bounds, up to log factors.
arXiv Detail & Related papers (2023-10-25T00:05:37Z) - Context-lumpable stochastic bandits [49.024050919419366]
We consider a contextual bandit problem with $S$ contexts and $K$ actions.
We give an algorithm that outputs an $epsilon$-optimal policy after using at most $widetilde O(r (S +K )/epsilon2)$ samples.
In the regret setting, we give an algorithm whose cumulative regret up to time $T$ is bounded by $widetilde O(sqrtr3(S+K)T)$.
arXiv Detail & Related papers (2023-06-22T17:20:30Z) - Near-Optimal Learning of Extensive-Form Games with Imperfect Information [54.55092907312749]
We present the first line of algorithms that require only $widetildemathcalO((XA+YB)/varepsilon2)$ episodes of play to find an $varepsilon$-approximate Nash equilibrium in two-player zero-sum games.
This improves upon the best known sample complexity of $widetildemathcalO((X2A+Y2B)/varepsilon2)$ by a factor of $widetildemathcalO(maxX,
arXiv Detail & Related papers (2022-02-03T18:18:28Z) - Doubly Optimal No-Regret Online Learning in Strongly Monotone Games with Bandit Feedback [29.553652241608997]
We study the class of textitsmooth and strongly monotone games and study optimal no-regret learning therein.
We first construct a new bandit learning algorithm and show that it achieves the single-agent optimal regret of $tildeTheta(nsqrtT)$.
Our results thus settle this open problem and contribute to the broad landscape of bandit game-theoretical learning.
arXiv Detail & Related papers (2021-12-06T08:27:54Z) - Almost Optimal Algorithms for Two-player Markov Games with Linear
Function Approximation [92.99933928528797]
We study reinforcement learning for two-player zero-sum Markov games with simultaneous moves.
We propose an algorithm Nash-UCRL-VTR based on the principle "Optimism-in-Face-of-Uncertainty"
We show that Nash-UCRL-VTR can provably achieve an $tildeO(dHsqrtT)$ regret, where $d$ is the linear function dimension.
arXiv Detail & Related papers (2021-02-15T09:09:16Z) - Impact of Representation Learning in Linear Bandits [83.17684841392754]
We study how representation learning can improve the efficiency of bandit problems.
We present a new algorithm which achieves $widetildeO(TsqrtkN + sqrtdkNT)$ regret, where $N$ is the number of rounds we play for each bandit.
arXiv Detail & Related papers (2020-10-13T16:35:30Z) - Near-Optimal Reinforcement Learning with Self-Play [50.29853537456737]
We focus on self-play algorithms which learn the optimal policy by playing against itself without any direct supervision.
We propose an optimistic variant of the emphNash Q-learning algorithm with sample complexity $tildemathcalO(SAB)$, and a new emphNash V-learning algorithm with sample complexity $tildemathcalO(S(A+B))$.
arXiv Detail & Related papers (2020-06-22T05:00:13Z) - Improved Sleeping Bandits with Stochastic Actions Sets and Adversarial
Rewards [59.559028338399855]
We consider the problem of sleeping bandits with action sets and adversarial rewards.
In this paper, we provide a new computationally efficient algorithm inspired by EXP3 satisfying a regret of order $O(sqrtT)$.
arXiv Detail & Related papers (2020-04-14T00:41:26Z) - Regret Minimization in Stochastic Contextual Dueling Bandits [40.17224226373741]
We consider the problem of $K$-armed dueling bandit in the contextual setting.
We present two algorithms for the setup with respective regret guarantees.
arXiv Detail & Related papers (2020-02-20T06:36:19Z) - Learning Zero-Sum Simultaneous-Move Markov Games Using Function
Approximation and Correlated Equilibrium [116.56359444619441]
We develop provably efficient reinforcement learning algorithms for two-player zero-sum finite-horizon Markov games.
In the offline setting, we control both players and aim to find the Nash Equilibrium by minimizing the duality gap.
In the online setting, we control a single player playing against an arbitrary opponent and aim to minimize the regret.
arXiv Detail & Related papers (2020-02-17T17:04:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.