Provable Self-Play Algorithms for Competitive Reinforcement Learning
- URL: http://arxiv.org/abs/2002.04017v3
- Date: Thu, 9 Jul 2020 17:07:54 GMT
- Title: Provable Self-Play Algorithms for Competitive Reinforcement Learning
- Authors: Yu Bai, Chi Jin
- Abstract summary: We study self-play in competitive reinforcement learning under the setting of Markov games.
We show that a self-play algorithm achieves regret $tildemathcalO(sqrtT)$ after playing $T$ steps of the game.
We also introduce an explore-then-exploit style algorithm, which achieves a slightly worse regret $tildemathcalO(T2/3)$, but is guaranteed to run in time even in the worst case.
- Score: 48.12602400021397
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Self-play, where the algorithm learns by playing against itself without
requiring any direct supervision, has become the new weapon in modern
Reinforcement Learning (RL) for achieving superhuman performance in practice.
However, the majority of exisiting theory in reinforcement learning only
applies to the setting where the agent plays against a fixed environment; it
remains largely open whether self-play algorithms can be provably effective,
especially when it is necessary to manage the exploration/exploitation
tradeoff. We study self-play in competitive reinforcement learning under the
setting of Markov games, a generalization of Markov decision processes to the
two-player case. We introduce a self-play algorithm---Value Iteration with
Upper/Lower Confidence Bound (VI-ULCB)---and show that it achieves regret
$\tilde{\mathcal{O}}(\sqrt{T})$ after playing $T$ steps of the game, where the
regret is measured by the agent's performance against a \emph{fully
adversarial} opponent who can exploit the agent's strategy at \emph{any} step.
We also introduce an explore-then-exploit style algorithm, which achieves a
slightly worse regret of $\tilde{\mathcal{O}}(T^{2/3})$, but is guaranteed to
run in polynomial time even in the worst case. To the best of our knowledge,
our work presents the first line of provably sample-efficient self-play
algorithms for competitive reinforcement learning.
Related papers
- In-Context Exploiter for Extensive-Form Games [38.24471816329584]
We introduce a novel method, In-Context Exploiter (ICE), to train a single model that can act as any player in the game and adaptively exploit opponents entirely by in-context learning.
Our ICE algorithm involves generating diverse opponent strategies, collecting interactive history data by a reinforcement learning algorithm, and training a transformer-based agent within a well-designed curriculum learning framework.
arXiv Detail & Related papers (2024-08-10T14:59:09Z) - Guarantees for Self-Play in Multiplayer Games via Polymatrix
Decomposability [2.2636685010313364]
Self-play is a technique for machine learning in multi-agent systems where a learning algorithm learns by interacting with copies of itself.
We show that in two-player constant-sum games, self-play that reaches Nash equilibrium is guaranteed to produce strategies that perform well against any post-training opponent.
For the first time, our results identify a structural property of multiplayer games that enable performance guarantees for the strategies produced by a broad class of self-play algorithms.
arXiv Detail & Related papers (2023-10-17T18:33:21Z) - Provably Efficient Generalized Lagrangian Policy Optimization for Safe
Multi-Agent Reinforcement Learning [105.7510838453122]
We examine online safe multi-agent reinforcement learning using constrained Markov games.
We develop an upper confidence reinforcement learning algorithm to solve this Lagrangian problem.
Our algorithm updates the minimax decision primal variables via online mirror descent and the dual variable via projected gradient step.
arXiv Detail & Related papers (2023-05-31T22:09:24Z) - No-Regret Learning in Time-Varying Zero-Sum Games [99.86860277006318]
Learning from repeated play in a fixed zero-sum game is a classic problem in game theory and online learning.
We develop a single parameter-free algorithm that simultaneously enjoys favorable guarantees under three performance measures.
Our algorithm is based on a two-layer structure with a meta-algorithm learning over a group of black-box base-learners satisfying a certain property.
arXiv Detail & Related papers (2022-01-30T06:10:04Z) - Efficient Competitive Self-Play Policy Optimization [20.023522000925094]
We propose a new algorithmic framework for competitive self-play reinforcement learning in two-player zero-sum games.
Our method trains several agents simultaneously, and intelligently takes each other as opponent based on simple adversarial rules.
We prove theoretically that our algorithm converges to an approximate equilibrium with high probability in convex-concave games.
arXiv Detail & Related papers (2020-09-13T21:01:38Z) - Learning to Play Sequential Games versus Unknown Opponents [93.8672371143881]
We consider a repeated sequential game between a learner, who plays first, and an opponent who responds to the chosen action.
We propose a novel algorithm for the learner when playing against an adversarial sequence of opponents.
Our results include algorithm's regret guarantees that depend on the regularity of the opponent's response.
arXiv Detail & Related papers (2020-07-10T09:33:05Z) - Efficient exploration of zero-sum stochastic games [83.28949556413717]
We investigate the increasingly important and common game-solving setting where we do not have an explicit description of the game but only oracle access to it through gameplay.
During a limited-duration learning phase, the algorithm can control the actions of both players in order to try to learn the game and how to play it well.
Our motivation is to quickly learn strategies that have low exploitability in situations where evaluating the payoffs of a queried strategy profile is costly.
arXiv Detail & Related papers (2020-02-24T20:30:38Z) - Learning Zero-Sum Simultaneous-Move Markov Games Using Function
Approximation and Correlated Equilibrium [116.56359444619441]
We develop provably efficient reinforcement learning algorithms for two-player zero-sum finite-horizon Markov games.
In the offline setting, we control both players and aim to find the Nash Equilibrium by minimizing the duality gap.
In the online setting, we control a single player playing against an arbitrary opponent and aim to minimize the regret.
arXiv Detail & Related papers (2020-02-17T17:04:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.