Learning in Markov Games with Adaptive Adversaries: Policy Regret, Fundamental Barriers, and Efficient Algorithms
- URL: http://arxiv.org/abs/2411.00707v1
- Date: Fri, 01 Nov 2024 16:17:27 GMT
- Title: Learning in Markov Games with Adaptive Adversaries: Policy Regret, Fundamental Barriers, and Efficient Algorithms
- Authors: Thanh Nguyen-Tang, Raman Arora,
- Abstract summary: We study learning in a dynamically evolving environment modeled as a Markov game between a learner and a strategic opponent.
We focus on emphpolicy regret -- a counterfactual notion that aims to compete with the return that would have been attained if the learner had followed the best fixed sequence of policy.
- Score: 24.928268203611047
- License:
- Abstract: We study learning in a dynamically evolving environment modeled as a Markov game between a learner and a strategic opponent that can adapt to the learner's strategies. While most existing works in Markov games focus on external regret as the learning objective, external regret becomes inadequate when the adversaries are adaptive. In this work, we focus on \emph{policy regret} -- a counterfactual notion that aims to compete with the return that would have been attained if the learner had followed the best fixed sequence of policy, in hindsight. We show that if the opponent has unbounded memory or if it is non-stationary, then sample-efficient learning is not possible. For memory-bounded and stationary, we show that learning is still statistically hard if the set of feasible strategies for the learner is exponentially large. To guarantee learnability, we introduce a new notion of \emph{consistent} adaptive adversaries, wherein, the adversary responds similarly to similar strategies of the learner. We provide algorithms that achieve $\sqrt{T}$ policy regret against memory-bounded, stationary, and consistent adversaries.
Related papers
- Adversaries With Incentives: A Strategic Alternative to Adversarial Robustness [11.722685584919757]
Adversarial training aims to defend against malicious opponents.
Our approach uses knowledge or beliefs regarding the opponent's possible incentives as inductive bias for learning.
We conduct a series of experiments that show how even mild knowledge regarding the adversary's incentives can be useful.
arXiv Detail & Related papers (2024-06-17T12:20:59Z) - Finding mixed-strategy equilibria of continuous-action games without
gradients using randomized policy networks [83.28949556413717]
We study the problem of computing an approximate Nash equilibrium of continuous-action game without access to gradients.
We model players' strategies using artificial neural networks.
This paper is the first to solve general continuous-action games with unrestricted mixed strategies and without any gradient information.
arXiv Detail & Related papers (2022-11-29T05:16:41Z) - Regret Minimization and Convergence to Equilibria in General-sum Markov
Games [57.568118148036376]
We present the first algorithm for learning in general-sum Markov games that provides sublinear regret guarantees when executed by all agents.
Our algorithm is decentralized, computationally efficient, and does not require any communication between agents.
arXiv Detail & Related papers (2022-07-28T16:27:59Z) - Learning what to remember [9.108546206438218]
We consider a lifelong learning scenario in which a learner faces a neverending stream of facts and has to decide which ones to retain in its limited memory.
We introduce a mathematical model based on the online learning framework, in which the learner measures itself against a collection of experts that are also memory-constrained.
We identify difficulties with using the multiplicative weights update algorithm in this memory-constrained scenario, and design an alternative scheme whose regret guarantees are close to the best possible.
arXiv Detail & Related papers (2022-01-11T06:42:50Z) - Inverse Reinforcement Learning for Strategy Identification [2.6572330982240935]
In adversarial environments, one side could gain an advantage by identifying the opponent's strategy.
This paper proposes to use inverse reinforcement learning (IRL) to identify strategies in adversarial environments.
arXiv Detail & Related papers (2021-07-31T17:22:52Z) - Disturbing Reinforcement Learning Agents with Corrupted Rewards [62.997667081978825]
We analyze the effects of different attack strategies based on reward perturbations on reinforcement learning algorithms.
We show that smoothly crafting adversarial rewards are able to mislead the learner, and that using low exploration probability values, the policy learned is more robust to corrupt rewards.
arXiv Detail & Related papers (2021-02-12T15:53:48Z) - Online Learning in Unknown Markov Games [55.07327246187741]
We study online learning in unknown Markov games.
We show that achieving sublinear regret against the best response in hindsight is statistically hard.
We present an algorithm that achieves a sublinear $tildemathcalO(K2/3)$ regret after $K$ episodes.
arXiv Detail & Related papers (2020-10-28T14:52:15Z) - Learning to Play Sequential Games versus Unknown Opponents [93.8672371143881]
We consider a repeated sequential game between a learner, who plays first, and an opponent who responds to the chosen action.
We propose a novel algorithm for the learner when playing against an adversarial sequence of opponents.
Our results include algorithm's regret guarantees that depend on the regularity of the opponent's response.
arXiv Detail & Related papers (2020-07-10T09:33:05Z) - Efficient Use of heuristics for accelerating XCS-based Policy Learning
in Markov Games [9.038065438586065]
In games, playing against non-stationary opponents with learning ability is still challenging for reinforcement learning agents.
This paper proposes efficient use of rough papers to speed up policy learning when playing against concurrent learners.
arXiv Detail & Related papers (2020-05-26T07:47:27Z) - Provable Self-Play Algorithms for Competitive Reinforcement Learning [48.12602400021397]
We study self-play in competitive reinforcement learning under the setting of Markov games.
We show that a self-play algorithm achieves regret $tildemathcalO(sqrtT)$ after playing $T$ steps of the game.
We also introduce an explore-then-exploit style algorithm, which achieves a slightly worse regret $tildemathcalO(T2/3)$, but is guaranteed to run in time even in the worst case.
arXiv Detail & Related papers (2020-02-10T18:44:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.