Matrix games with bandit feedback
- URL: http://arxiv.org/abs/2006.05145v2
- Date: Sat, 12 Jun 2021 10:23:31 GMT
- Title: Matrix games with bandit feedback
- Authors: Brendan O'Donoghue, Tor Lattimore, Ian Osband
- Abstract summary: We study a version of the classical zero-sum matrix game with unknown payoff matrix and bandit feedback.
We show that Thompson fails catastrophically in this setting and provide empirical comparison to existing algorithms.
- Score: 33.637621576707076
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study a version of the classical zero-sum matrix game with unknown payoff
matrix and bandit feedback, where the players only observe each others actions
and a noisy payoff. This generalizes the usual matrix game, where the payoff
matrix is known to the players. Despite numerous applications, this problem has
received relatively little attention. Although adversarial bandit algorithms
achieve low regret, they do not exploit the matrix structure and perform poorly
relative to the new algorithms. The main contributions are regret analyses of
variants of UCB and K-learning that hold for any opponent, e.g., even when the
opponent adversarially plays the best-response to the learner's mixed strategy.
Along the way, we show that Thompson fails catastrophically in this setting and
provide empirical comparison to existing algorithms.
Related papers
- Finite-Sample Guarantees for Best-Response Learning Dynamics in Zero-Sum Matrix Games [22.380293155135096]
We study best-response type learning dynamics for two player zero-sum matrix games.
We consider two settings that are distinguished by the type of information that each player has about the game and their opponent's strategy.
arXiv Detail & Related papers (2024-07-29T15:56:49Z) - No Algorithmic Collusion in Two-Player Blindfolded Game with Thompson Sampling [10.376707874029561]
We show that when the players use Thompson sampling, the game dynamics converge to the Nash equilibrium under a mild assumption on the payoff matrices.
algorithmic collusion doesn't arise in this case despite the fact that the players do not intentionally deploy competitive strategies.
arXiv Detail & Related papers (2024-05-23T08:21:48Z) - No-Regret Learning in Time-Varying Zero-Sum Games [99.86860277006318]
Learning from repeated play in a fixed zero-sum game is a classic problem in game theory and online learning.
We develop a single parameter-free algorithm that simultaneously enjoys favorable guarantees under three performance measures.
Our algorithm is based on a two-layer structure with a meta-algorithm learning over a group of black-box base-learners satisfying a certain property.
arXiv Detail & Related papers (2022-01-30T06:10:04Z) - Balancing Adaptability and Non-exploitability in Repeated Games [29.04618208068207]
We study the problem of guaranteeing low regret in repeated games against an opponent with unknown membership in one of several classes.
We add the constraint that our algorithm is non-exploitable, in that the opponent lacks an incentive to use an algorithm against which we cannot achieve rewards exceeding some "fair" value.
Our solution is an expert algorithm (LAFF) that searches within a set of sub-algorithms that are optimal for each opponent class and uses a punishment policy upon detecting evidence of exploitation by the opponent.
arXiv Detail & Related papers (2021-12-20T03:09:30Z) - Online Learning in Unknown Markov Games [55.07327246187741]
We study online learning in unknown Markov games.
We show that achieving sublinear regret against the best response in hindsight is statistically hard.
We present an algorithm that achieves a sublinear $tildemathcalO(K2/3)$ regret after $K$ episodes.
arXiv Detail & Related papers (2020-10-28T14:52:15Z) - Learning to Play Sequential Games versus Unknown Opponents [93.8672371143881]
We consider a repeated sequential game between a learner, who plays first, and an opponent who responds to the chosen action.
We propose a novel algorithm for the learner when playing against an adversarial sequence of opponents.
Our results include algorithm's regret guarantees that depend on the regularity of the opponent's response.
arXiv Detail & Related papers (2020-07-10T09:33:05Z) - Learning Zero-Sum Simultaneous-Move Markov Games Using Function
Approximation and Correlated Equilibrium [116.56359444619441]
We develop provably efficient reinforcement learning algorithms for two-player zero-sum finite-horizon Markov games.
In the offline setting, we control both players and aim to find the Nash Equilibrium by minimizing the duality gap.
In the online setting, we control a single player playing against an arbitrary opponent and aim to minimize the regret.
arXiv Detail & Related papers (2020-02-17T17:04:16Z) - Provable Self-Play Algorithms for Competitive Reinforcement Learning [48.12602400021397]
We study self-play in competitive reinforcement learning under the setting of Markov games.
We show that a self-play algorithm achieves regret $tildemathcalO(sqrtT)$ after playing $T$ steps of the game.
We also introduce an explore-then-exploit style algorithm, which achieves a slightly worse regret $tildemathcalO(T2/3)$, but is guaranteed to run in time even in the worst case.
arXiv Detail & Related papers (2020-02-10T18:44:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.