Related papers: Matrix games with bandit feedback

Matrix games with bandit feedback

URL: http://arxiv.org/abs/2006.05145v2
Date: Sat, 12 Jun 2021 10:23:31 GMT
Title: Matrix games with bandit feedback
Authors: Brendan O'Donoghue, Tor Lattimore, Ian Osband
Abstract summary: We study a version of the classical zero-sum matrix game with unknown payoff matrix and bandit feedback. We show that Thompson fails catastrophically in this setting and provide empirical comparison to existing algorithms.
Score: 33.637621576707076
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We study a version of the classical zero-sum matrix game with unknown payoff matrix and bandit feedback, where the players only observe each others actions and a noisy payoff. This generalizes the usual matrix game, where the payoff matrix is known to the players. Despite numerous applications, this problem has received relatively little attention. Although adversarial bandit algorithms achieve low regret, they do not exploit the matrix structure and perform poorly relative to the new algorithms. The main contributions are regret analyses of variants of UCB and K-learning that hold for any opponent, e.g., even when the opponent adversarially plays the best-response to the learner's mixed strategy. Along the way, we show that Thompson fails catastrophically in this setting and provide empirical comparison to existing algorithms.

Related papers

Randomised Optimism via Competitive Co-Evolution for Matrix Games with Bandit Feedback [0.0]
This work investigates two-player zero-sum matrix games with an unknown payoff matrix and bandit feedback.<n>We propose Competitive Co-evolutionary Bandit Learning (coebl), a novel algorithm that integrates evolutionary algorithms into the bandit framework.<n>We prove that coebl achieves sublinear regret, matching the performance of deterministic optimism-based methods.
arXiv Detail & Related papers (2025-05-19T10:05:55Z)
Finite-Sample Guarantees for Best-Response Learning Dynamics in Zero-Sum Matrix Games [22.380293155135096]
We study best-response type learning dynamics for two player zero-sum matrix games. We consider two settings that are distinguished by the type of information that each player has about the game and their opponent's strategy.
arXiv Detail & Related papers (2024-07-29T15:56:49Z)
No Algorithmic Collusion in Two-Player Blindfolded Game with Thompson Sampling [10.376707874029561]
We show that when the players use Thompson sampling, the game dynamics converge to the Nash equilibrium under a mild assumption on the payoff matrices. algorithmic collusion doesn't arise in this case despite the fact that the players do not intentionally deploy competitive strategies.
arXiv Detail & Related papers (2024-05-23T08:21:48Z)
On the Limitations and Possibilities of Nash Regret Minimization in Zero-Sum Matrix Games under Noisy Feedback [21.332966440675758]
Nash regret is the difference between the player's total reward and the game's Nash equilibrium value scaled by the time horizon $T$.<n>We show that standard algorithms incur Nash regret even when the row player receives noisy feedback on the entire matrix $A$.<n>We present the first algorithm for general $n times m$ matrix games that achieves instance-dependent $textpolylog(T)$ Nash regret.
arXiv Detail & Related papers (2023-06-22T22:45:48Z)
No-Regret Learning in Time-Varying Zero-Sum Games [99.86860277006318]
Learning from repeated play in a fixed zero-sum game is a classic problem in game theory and online learning. We develop a single parameter-free algorithm that simultaneously enjoys favorable guarantees under three performance measures. Our algorithm is based on a two-layer structure with a meta-algorithm learning over a group of black-box base-learners satisfying a certain property.
arXiv Detail & Related papers (2022-01-30T06:10:04Z)
Balancing Adaptability and Non-exploitability in Repeated Games [29.04618208068207]
We study the problem of guaranteeing low regret in repeated games against an opponent with unknown membership in one of several classes. We add the constraint that our algorithm is non-exploitable, in that the opponent lacks an incentive to use an algorithm against which we cannot achieve rewards exceeding some "fair" value. Our solution is an expert algorithm (LAFF) that searches within a set of sub-algorithms that are optimal for each opponent class and uses a punishment policy upon detecting evidence of exploitation by the opponent.
arXiv Detail & Related papers (2021-12-20T03:09:30Z)
Online Learning in Unknown Markov Games [55.07327246187741]
We study online learning in unknown Markov games. We show that achieving sublinear regret against the best response in hindsight is statistically hard. We present an algorithm that achieves a sublinear $tildemathcalO(K2/3)$ regret after $K$ episodes.
arXiv Detail & Related papers (2020-10-28T14:52:15Z)
Learning to Play Sequential Games versus Unknown Opponents [93.8672371143881]
We consider a repeated sequential game between a learner, who plays first, and an opponent who responds to the chosen action. We propose a novel algorithm for the learner when playing against an adversarial sequence of opponents. Our results include algorithm's regret guarantees that depend on the regularity of the opponent's response.
arXiv Detail & Related papers (2020-07-10T09:33:05Z)
Learning Zero-Sum Simultaneous-Move Markov Games Using Function Approximation and Correlated Equilibrium [116.56359444619441]
We develop provably efficient reinforcement learning algorithms for two-player zero-sum finite-horizon Markov games. In the offline setting, we control both players and aim to find the Nash Equilibrium by minimizing the duality gap. In the online setting, we control a single player playing against an arbitrary opponent and aim to minimize the regret.
arXiv Detail & Related papers (2020-02-17T17:04:16Z)
Provable Self-Play Algorithms for Competitive Reinforcement Learning [48.12602400021397]
We study self-play in competitive reinforcement learning under the setting of Markov games. We show that a self-play algorithm achieves regret $tildemathcalO(sqrtT)$ after playing $T$ steps of the game. We also introduce an explore-then-exploit style algorithm, which achieves a slightly worse regret $tildemathcalO(T2/3)$, but is guaranteed to run in time even in the worst case.
arXiv Detail & Related papers (2020-02-10T18:44:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.