Online Learning in Unknown Markov Games
- URL: http://arxiv.org/abs/2010.15020v2
- Date: Sat, 6 Feb 2021 05:24:25 GMT
- Title: Online Learning in Unknown Markov Games
- Authors: Yi Tian, Yuanhao Wang, Tiancheng Yu, Suvrit Sra
- Abstract summary: We study online learning in unknown Markov games.
We show that achieving sublinear regret against the best response in hindsight is statistically hard.
We present an algorithm that achieves a sublinear $tildemathcalO(K2/3)$ regret after $K$ episodes.
- Score: 55.07327246187741
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study online learning in unknown Markov games, a problem that arises in
episodic multi-agent reinforcement learning where the actions of the opponents
are unobservable. We show that in this challenging setting, achieving sublinear
regret against the best response in hindsight is statistically hard. We then
consider a weaker notion of regret by competing with the \emph{minimax value}
of the game, and present an algorithm that achieves a sublinear
$\tilde{\mathcal{O}}(K^{2/3})$ regret after $K$ episodes. This is the first
sublinear regret bound (to our knowledge) for online learning in unknown Markov
games. Importantly, our regret bound is independent of the size of the
opponents' action spaces. As a result, even when the opponents' actions are
fully observable, our regret bound improves upon existing analysis (e.g., (Xie
et al., 2020)) by an exponential factor in the number of opponents.
Related papers
- Learning in Markov Games with Adaptive Adversaries: Policy Regret, Fundamental Barriers, and Efficient Algorithms [24.928268203611047]
We study learning in a dynamically evolving environment modeled as a Markov game between a learner and a strategic opponent.
We focus on emphpolicy regret -- a counterfactual notion that aims to compete with the return that would have been attained if the learner had followed the best fixed sequence of policy.
arXiv Detail & Related papers (2024-11-01T16:17:27Z) - Is Learning in Games Good for the Learners? [14.781100349601587]
We consider tradeoffs between reward and regret in repeated gameplay between two agents.
We show that any such equilibrium is reachable by a pair of algorithms which maintain their regret guarantees against arbitrary opponents.
We also consider the question of learning reward-optimal strategies via repeated play with a no-regret agent when game is initially unknown.
arXiv Detail & Related papers (2023-05-31T02:10:27Z) - Online Learning in Stackelberg Games with an Omniscient Follower [83.42564921330896]
We study the problem of online learning in a two-player decentralized cooperative Stackelberg game.
In each round, the leader first takes an action, followed by the follower who takes their action after observing the leader's move.
We show that depending on the reward structure, the existence of the omniscient follower may change the sample complexity drastically.
arXiv Detail & Related papers (2023-01-27T03:35:10Z) - No-Regret Learning in Time-Varying Zero-Sum Games [99.86860277006318]
Learning from repeated play in a fixed zero-sum game is a classic problem in game theory and online learning.
We develop a single parameter-free algorithm that simultaneously enjoys favorable guarantees under three performance measures.
Our algorithm is based on a two-layer structure with a meta-algorithm learning over a group of black-box base-learners satisfying a certain property.
arXiv Detail & Related papers (2022-01-30T06:10:04Z) - Model-Free Online Learning in Unknown Sequential Decision Making
Problems and Games [114.90723492840499]
In large two-player zero-sum imperfect-information games, modern extensions of counterfactual regret minimization (CFR) are currently the practical state of the art for computing a Nash equilibrium.
We formalize an online learning setting in which the strategy space is not known to the agent.
We give an efficient algorithm that achieves $O(T3/4)$ regret with high probability for that setting, even when the agent faces an adversarial environment.
arXiv Detail & Related papers (2021-03-08T04:03:24Z) - Learning to Play Sequential Games versus Unknown Opponents [93.8672371143881]
We consider a repeated sequential game between a learner, who plays first, and an opponent who responds to the chosen action.
We propose a novel algorithm for the learner when playing against an adversarial sequence of opponents.
Our results include algorithm's regret guarantees that depend on the regularity of the opponent's response.
arXiv Detail & Related papers (2020-07-10T09:33:05Z) - Matrix games with bandit feedback [33.637621576707076]
We study a version of the classical zero-sum matrix game with unknown payoff matrix and bandit feedback.
We show that Thompson fails catastrophically in this setting and provide empirical comparison to existing algorithms.
arXiv Detail & Related papers (2020-06-09T09:36:21Z) - Provable Self-Play Algorithms for Competitive Reinforcement Learning [48.12602400021397]
We study self-play in competitive reinforcement learning under the setting of Markov games.
We show that a self-play algorithm achieves regret $tildemathcalO(sqrtT)$ after playing $T$ steps of the game.
We also introduce an explore-then-exploit style algorithm, which achieves a slightly worse regret $tildemathcalO(T2/3)$, but is guaranteed to run in time even in the worst case.
arXiv Detail & Related papers (2020-02-10T18:44:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.