Adversarial Learning in Games with Bandit Feedback: Logarithmic Pure-Strategy Maximin Regret
- URL: http://arxiv.org/abs/2602.06348v1
- Date: Fri, 06 Feb 2026 03:26:01 GMT
- Title: Adversarial Learning in Games with Bandit Feedback: Logarithmic Pure-Strategy Maximin Regret
- Authors: Shinji Ito, Haipeng Luo, Arnab Maiti, Taira Tsuchiya, Yue Wu,
- Abstract summary: Learning to play zero-sum games is a fundamental problem in game theory and machine learning.<n>We investigate adversarial learning in zero-sum games under bandit feedback, aiming to minimize the deficit against the maximin pure strategy.<n>We show that the Tsallis-INF algorithm achieves $O(c log T)$ instance-dependent regret with a game-dependent parameter $c$.
- Score: 64.73231630190121
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Learning to play zero-sum games is a fundamental problem in game theory and machine learning. While significant progress has been made in minimizing external regret in the self-play settings or with full-information feedback, real-world applications often force learners to play against unknown, arbitrary opponents and restrict learners to bandit feedback where only the payoff of the realized action is observable. In such challenging settings, it is well-known that $Ω(\sqrt{T})$ external regret is unavoidable (where T is the number of rounds). To overcome this barrier, we investigate adversarial learning in zero-sum games under bandit feedback, aiming to minimize the deficit against the maximin pure strategy -- a metric we term Pure-Strategy Maximin Regret. We analyze this problem under two bandit feedback models: uninformed (only the realized reward is revealed) and informed (both the reward and the opponent's action are revealed). For uninformed bandit learning of normal-form games, we show that the Tsallis-INF algorithm achieves $O(c \log T)$ instance-dependent regret with a game-dependent parameter $c$. Crucially, we prove an information-theoretic lower bound showing that the dependence on c is necessary. To overcome this hardness, we turn to the informed setting and introduce Maximin-UCB, which obtains another regret bound of the form $O(c' \log T)$ for a different game-dependent parameter $c'$ that could potentially be much smaller than $c$. Finally, we generalize both results to bilinear games over an arbitrary, large action set, proposing Tsallis-FTRL-SPM and Maximin-LinUCB for the uninformed and informed setting respectively and establishing similar game-dependent logarithmic regret bounds.
Related papers
- Instance-Dependent Regret Bounds for Learning Two-Player Zero-Sum Games with Bandit Feedback [60.610120215789976]
We show that when a pure strategy Nash equilibrium exists, $c$ becomes zero, leading to an optimal instance-dependent regret bound.<n>Our algorithm also enjoys last-iterate convergence and can identify the pure strategy Nash equilibrium with near-optimal sample.
arXiv Detail & Related papers (2025-02-24T20:20:06Z) - Corrupted Learning Dynamics in Games [62.73758165845971]
An equilibrium can be computed at a fast rate of $O(log T)$ when all players follow the optimistic follow-the-regularized-leader (OFTRL)<n>We present corrupted learning dynamics that adaptively find an equilibrium at a rate that depends on the extent to which each player deviates from the strategy suggested by the prescribed algorithm.
arXiv Detail & Related papers (2024-12-10T02:23:44Z) - Provably Efficient Generalized Lagrangian Policy Optimization for Safe
Multi-Agent Reinforcement Learning [105.7510838453122]
We examine online safe multi-agent reinforcement learning using constrained Markov games.
We develop an upper confidence reinforcement learning algorithm to solve this Lagrangian problem.
Our algorithm updates the minimax decision primal variables via online mirror descent and the dual variable via projected gradient step.
arXiv Detail & Related papers (2023-05-31T22:09:24Z) - Is Learning in Games Good for the Learners? [14.781100349601587]
We consider tradeoffs between reward and regret in repeated gameplay between two agents.
We show that any such equilibrium is reachable by a pair of algorithms which maintain their regret guarantees against arbitrary opponents.
We also consider the question of learning reward-optimal strategies via repeated play with a no-regret agent when game is initially unknown.
arXiv Detail & Related papers (2023-05-31T02:10:27Z) - Achieving Better Regret against Strategic Adversaries [15.51709428653595]
We study online learning problems in which the learner has extra knowledge about the adversary's behaviour.
We propose two new online learning algorithms, Accurate Follow the Regularized Leader (AFTRL) and Prod-Best Response (Prod-BR)
AFTRL achieves $O(1)$ external regret or $O(1)$ emphforward regret against no-external regret adversary.
arXiv Detail & Related papers (2023-02-13T19:34:36Z) - No-regret learning for repeated non-cooperative games with lossy bandits [5.437709019435926]
We study the asynchronous online learning strategy of the players to adaptively adjust the next actions for minimizing the long-term regret loss.
The paper provides a novel no-regret learning algorithm, called Online Gradient Descent with lossy bandits (OGD-lb)
arXiv Detail & Related papers (2022-05-14T05:02:56Z) - Complete Policy Regret Bounds for Tallying Bandits [51.039677652803675]
Policy regret is a well established notion of measuring the performance of an online learning algorithm against an adaptive adversary.
We study restrictions on the adversary that enable efficient minimization of the emphcomplete policy regret
We provide an algorithm that w.h.p a complete policy regret guarantee of $tildemathcalO(mKsqrtT)$, where the $tildemathcalO$ notation hides only logarithmic factors.
arXiv Detail & Related papers (2022-04-24T03:10:27Z) - Generalized Bandit Regret Minimizer Framework in Imperfect Information
Extensive-Form Game [9.933208900617174]
We consider the problem in the interactive bandit-feedback setting where we don't know the dynamics of the IIEG.
To learn NE, the regret minimizer is required to estimate the full-feedback loss gradient $ellt$ by $v(zt)$ and minimize the regret.
We present a novel method SIX-OMD to learn approximate NE. It is model-free and extremely improves the best existing convergence rate from the order of $O(sqrtX B/T+sqrtY C/T)$ to $O(
arXiv Detail & Related papers (2022-03-11T13:45:42Z) - The Best of Both Worlds: Reinforcement Learning with Logarithmic Regret
and Policy Switches [84.54669549718075]
We study the problem of regret minimization for episodic Reinforcement Learning (RL)
We focus on learning with general function classes and general model classes.
We show that a logarithmic regret bound is realizable by algorithms with $O(log T)$ switching cost.
arXiv Detail & Related papers (2022-03-03T02:55:55Z) - Online Learning in Unknown Markov Games [55.07327246187741]
We study online learning in unknown Markov games.
We show that achieving sublinear regret against the best response in hindsight is statistically hard.
We present an algorithm that achieves a sublinear $tildemathcalO(K2/3)$ regret after $K$ episodes.
arXiv Detail & Related papers (2020-10-28T14:52:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.