Logarithmic Regret for Matrix Games against an Adversary with Noisy
Bandit Feedback
- URL: http://arxiv.org/abs/2306.13233v2
- Date: Tue, 24 Oct 2023 22:51:29 GMT
- Title: Logarithmic Regret for Matrix Games against an Adversary with Noisy
Bandit Feedback
- Authors: Arnab Maiti, Kevin Jamieson, Lillian J. Ratliff
- Abstract summary: This paper considers a variant of zero-sum matrix games where at each timestep the row player chooses row $i$, the column player chooses column $j$, and the row player receives a noisy reward with mean $A_i,j$.
The objective of the row player is to accumulate as much reward as possible, even against an adversarial column player.
If the row player uses the EXP3 strategy, an algorithm known for obtaining $sqrtT$ regret against an arbitrary sequence of rewards, it is immediate that the row player also achieves $sqrtT
- Score: 23.97611639885236
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper considers a variant of zero-sum matrix games where at each
timestep the row player chooses row $i$, the column player chooses column $j$,
and the row player receives a noisy reward with mean $A_{i,j}$. The objective
of the row player is to accumulate as much reward as possible, even against an
adversarial column player. If the row player uses the EXP3 strategy, an
algorithm known for obtaining $\sqrt{T}$ regret against an arbitrary sequence
of rewards, it is immediate that the row player also achieves $\sqrt{T}$ regret
relative to the Nash equilibrium in this game setting. However, partly
motivated by the fact that the EXP3 strategy is myopic to the structure of the
game, O'Donoghue et al. (2021) proposed a UCB-style algorithm that leverages
the game structure and demonstrated that this algorithm greatly outperforms
EXP3 empirically. While they showed that this UCB-style algorithm achieved
$\sqrt{T}$ regret, in this paper we ask if there exists an algorithm that
provably achieves $\text{polylog}(T)$ regret against any adversary, analogous
to results from stochastic bandits. We propose a novel algorithm that answers
this question in the affirmative for the simple $2 \times 2$ setting, providing
the first instance-dependent guarantees for games in the regret setting. Our
algorithm overcomes two major hurdles: 1) obtaining logarithmic regret even
though the Nash equilibrium is estimable only at a $1/\sqrt{T}$ rate, and 2)
designing row-player strategies that guarantee that either the adversary
provides information about the Nash equilibrium, or the row player incurs
negative regret. Moreover, in the full information case we address the general
$n \times m$ case where the first hurdle is still relevant. Finally, we show
that EXP3 and the UCB-based algorithm necessarily cannot perform better than
$\sqrt{T}$.
Related papers
- Context-lumpable stochastic bandits [49.024050919419366]
We consider a contextual bandit problem with $S$ contexts and $K$ actions.
We give an algorithm that outputs an $epsilon$-optimal policy after using at most $widetilde O(r (S +K )/epsilon2)$ samples.
In the regret setting, we give an algorithm whose cumulative regret up to time $T$ is bounded by $widetilde O(sqrtr3(S+K)T)$.
arXiv Detail & Related papers (2023-06-22T17:20:30Z) - Near-Optimal Learning of Extensive-Form Games with Imperfect Information [54.55092907312749]
We present the first line of algorithms that require only $widetildemathcalO((XA+YB)/varepsilon2)$ episodes of play to find an $varepsilon$-approximate Nash equilibrium in two-player zero-sum games.
This improves upon the best known sample complexity of $widetildemathcalO((X2A+Y2B)/varepsilon2)$ by a factor of $widetildemathcalO(maxX,
arXiv Detail & Related papers (2022-02-03T18:18:28Z) - Doubly Optimal No-Regret Online Learning in Strongly Monotone Games with Bandit Feedback [29.553652241608997]
We study the class of textitsmooth and strongly monotone games and study optimal no-regret learning therein.
We first construct a new bandit learning algorithm and show that it achieves the single-agent optimal regret of $tildeTheta(nsqrtT)$.
Our results thus settle this open problem and contribute to the broad landscape of bandit game-theoretical learning.
arXiv Detail & Related papers (2021-12-06T08:27:54Z) - Almost Optimal Algorithms for Two-player Markov Games with Linear
Function Approximation [92.99933928528797]
We study reinforcement learning for two-player zero-sum Markov games with simultaneous moves.
We propose an algorithm Nash-UCRL-VTR based on the principle "Optimism-in-Face-of-Uncertainty"
We show that Nash-UCRL-VTR can provably achieve an $tildeO(dHsqrtT)$ regret, where $d$ is the linear function dimension.
arXiv Detail & Related papers (2021-02-15T09:09:16Z) - Impact of Representation Learning in Linear Bandits [83.17684841392754]
We study how representation learning can improve the efficiency of bandit problems.
We present a new algorithm which achieves $widetildeO(TsqrtkN + sqrtdkNT)$ regret, where $N$ is the number of rounds we play for each bandit.
arXiv Detail & Related papers (2020-10-13T16:35:30Z) - Near-Optimal Reinforcement Learning with Self-Play [50.29853537456737]
We focus on self-play algorithms which learn the optimal policy by playing against itself without any direct supervision.
We propose an optimistic variant of the emphNash Q-learning algorithm with sample complexity $tildemathcalO(SAB)$, and a new emphNash V-learning algorithm with sample complexity $tildemathcalO(S(A+B))$.
arXiv Detail & Related papers (2020-06-22T05:00:13Z) - Improved Sleeping Bandits with Stochastic Actions Sets and Adversarial
Rewards [59.559028338399855]
We consider the problem of sleeping bandits with action sets and adversarial rewards.
In this paper, we provide a new computationally efficient algorithm inspired by EXP3 satisfying a regret of order $O(sqrtT)$.
arXiv Detail & Related papers (2020-04-14T00:41:26Z) - Regret Minimization in Stochastic Contextual Dueling Bandits [40.17224226373741]
We consider the problem of $K$-armed dueling bandit in the contextual setting.
We present two algorithms for the setup with respective regret guarantees.
arXiv Detail & Related papers (2020-02-20T06:36:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.