Achieving Better Regret against Strategic Adversaries
- URL: http://arxiv.org/abs/2302.06652v1
- Date: Mon, 13 Feb 2023 19:34:36 GMT
- Title: Achieving Better Regret against Strategic Adversaries
- Authors: Le Cong Dinh, Tri-Dung Nguyen, Alain Zemkoho and Long Tran-Thanh
- Abstract summary: We study online learning problems in which the learner has extra knowledge about the adversary's behaviour.
We propose two new online learning algorithms, Accurate Follow the Regularized Leader (AFTRL) and Prod-Best Response (Prod-BR)
AFTRL achieves $O(1)$ external regret or $O(1)$ emphforward regret against no-external regret adversary.
- Score: 15.51709428653595
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: We study online learning problems in which the learner has extra knowledge
about the adversary's behaviour, i.e., in game-theoretic settings where
opponents typically follow some no-external regret learning algorithms. Under
this assumption, we propose two new online learning algorithms, Accurate Follow
the Regularized Leader (AFTRL) and Prod-Best Response (Prod-BR), that
intensively exploit this extra knowledge while maintaining the no-regret
property in the worst-case scenario of having inaccurate extra information.
Specifically, AFTRL achieves $O(1)$ external regret or $O(1)$ \emph{forward
regret} against no-external regret adversary in comparison with $O(\sqrt{T})$
\emph{dynamic regret} of Prod-BR. To the best of our knowledge, our algorithm
is the first to consider forward regret that achieves $O(1)$ regret against
strategic adversaries. When playing zero-sum games with Accurate Multiplicative
Weights Update (AMWU), a special case of AFTRL, we achieve \emph{last round
convergence} to the Nash Equilibrium. We also provide numerical experiments to
further support our theoretical results. In particular, we demonstrate that our
methods achieve significantly better regret bounds and rate of last round
convergence, compared to the state of the art (e.g., Multiplicative Weights
Update (MWU) and its optimistic counterpart, OMWU).
Related papers
- Best of Both Worlds: Regret Minimization versus Minimax Play [57.68976579579758]
We show that our results allow us to guarantee to risk at most $O(1)$ loss while being able to gain $Omega(T)$ from exploitable opponents.
arXiv Detail & Related papers (2025-02-17T11:04:01Z) - Is Learning in Games Good for the Learners? [14.781100349601587]
We consider tradeoffs between reward and regret in repeated gameplay between two agents.
We show that any such equilibrium is reachable by a pair of algorithms which maintain their regret guarantees against arbitrary opponents.
We also consider the question of learning reward-optimal strategies via repeated play with a no-regret agent when game is initially unknown.
arXiv Detail & Related papers (2023-05-31T02:10:27Z) - Best of Both Worlds Policy Optimization [33.13041034490332]
We show that by properly designing the regularizer, the exploration bonus and the learning rates, one can achieve a more favorable polylog$(T)$ regret when the losses are adversarial.
This is the first time a gap-dependent polylog$(T)$ regret bound is shown for policy optimization.
arXiv Detail & Related papers (2023-02-18T19:46:11Z) - Near-Optimal $\Phi$-Regret Learning in Extensive-Form Games [85.78272987312343]
We establish efficient and uncoupled learning dynamics so that the trigger regret of each player grows as $O(log T)$ after $T$ repetitions of play.
This improves exponentially over the prior best known trigger-regret bound of $O(T1/4)$.
arXiv Detail & Related papers (2022-08-20T20:48:58Z) - The Best of Both Worlds: Reinforcement Learning with Logarithmic Regret
and Policy Switches [84.54669549718075]
We study the problem of regret minimization for episodic Reinforcement Learning (RL)
We focus on learning with general function classes and general model classes.
We show that a logarithmic regret bound is realizable by algorithms with $O(log T)$ switching cost.
arXiv Detail & Related papers (2022-03-03T02:55:55Z) - Near-Optimal No-Regret Learning for Correlated Equilibria in
Multi-Player General-Sum Games [104.74734408204749]
We show that if all agents in a multi-player general-sum normal-form game employ Optimistic Multiplicative Weights Update (OMWU), the external regret of every player is $O(textrmpolylog(T))$ after $T$ repetitions of the game.
We extend their result from external regret to internal regret and swap regret, thereby establishing uncoupled learning dynamics that converge to an approximate correlated equilibrium.
arXiv Detail & Related papers (2021-11-11T01:19:53Z) - Evolutionary Dynamics and $\Phi$-Regret Minimization in Games [38.00008966802513]
Regret has been established as a foundational concept in online learning, and likewise has important applications in the analysis of learning dynamics in games.
In this paper, we revisit our understanding of regret from the perspective of deviations over partitions of the full emphmixed strategy space.
We prove here that the well-studied evolutionary learning algorithm of replicator dynamics (RD) seamlessly minimizes the strongest possible form of $Phi$-regret in generic $2 times 2$ games.
arXiv Detail & Related papers (2021-06-28T12:48:15Z) - Model-Free Online Learning in Unknown Sequential Decision Making
Problems and Games [114.90723492840499]
In large two-player zero-sum imperfect-information games, modern extensions of counterfactual regret minimization (CFR) are currently the practical state of the art for computing a Nash equilibrium.
We formalize an online learning setting in which the strategy space is not known to the agent.
We give an efficient algorithm that achieves $O(T3/4)$ regret with high probability for that setting, even when the agent faces an adversarial environment.
arXiv Detail & Related papers (2021-03-08T04:03:24Z) - Online Markov Decision Processes with Aggregate Bandit Feedback [74.85532145498742]
We study a novel variant of online finite-horizon Markov Decision Processes with adversarially changing loss functions and initially unknown dynamics.
In each episode, the learner suffers the loss accumulated along the trajectory realized by the policy chosen for the episode, and observes aggregate bandit feedback.
Our main result is a computationally efficient algorithm with $O(sqrtK)$ regret for this setting, where $K$ is the number of episodes.
arXiv Detail & Related papers (2021-01-31T16:49:07Z) - Provable Self-Play Algorithms for Competitive Reinforcement Learning [48.12602400021397]
We study self-play in competitive reinforcement learning under the setting of Markov games.
We show that a self-play algorithm achieves regret $tildemathcalO(sqrtT)$ after playing $T$ steps of the game.
We also introduce an explore-then-exploit style algorithm, which achieves a slightly worse regret $tildemathcalO(T2/3)$, but is guaranteed to run in time even in the worst case.
arXiv Detail & Related papers (2020-02-10T18:44:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.