Related papers: Anytime Optimal PSRO for Two-Player Zero-Sum Games

Anytime Optimal PSRO for Two-Player Zero-Sum Games

URL: http://arxiv.org/abs/2201.07700v1
Date: Wed, 19 Jan 2022 16:34:11 GMT
Title: Anytime Optimal PSRO for Two-Player Zero-Sum Games
Authors: Stephen McAleer, Kevin Wang, Marc Lanctot, John Lanier, Pierre Baldi, Roy Fox
Abstract summary: Policy Space Response Oracles (PSRO) is a reinforcement learning algorithm for games that can handle continuous actions. AODO is a double oracle algorithm for 2-player zero-sum games that converges to a Nash equilibrium. We show that our methods achieve far lower exploitability than DO and PSRO and never increase exploitability.
Score: 17.821479538423155
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Policy Space Response Oracles (PSRO) is a multi-agent reinforcement learning algorithm for games that can handle continuous actions and has empirically found approximate Nash equilibria in large games. PSRO is based on the tabular Double Oracle (DO) method, an algorithm that is guaranteed to converge to a Nash equilibrium, but may increase exploitability from one iteration to the next. We propose Anytime Optimal Double Oracle (AODO), a tabular double oracle algorithm for 2-player zero-sum games that is guaranteed to converge to a Nash equilibrium while decreasing exploitability from iteration to iteration. Unlike DO, in which the meta-strategy is based on the restricted game formed by each player's strategy sets, AODO finds the meta-strategy for each player that minimizes its exploitability against any policy in the full, unrestricted game. We also propose a method of finding this meta-strategy via a no-regret algorithm updated against a continually-trained best response, called RM-BR DO. Finally, we propose Anytime Optimal PSRO, a version of AODO that calculates best responses via reinforcement learning. In experiments on Leduc poker and random normal form games, we show that our methods achieve far lower exploitability than DO and PSRO and never increase exploitability.

Related papers

ApproxED: Approximate exploitability descent via learned best responses [61.17702187957206]
We study the problem of finding an approximate Nash equilibrium of games with continuous action sets. We propose two new methods that minimize an approximation of exploitability with respect to the strategy profile.
arXiv Detail & Related papers (2023-01-20T23:55:30Z)
Provably Efficient Fictitious Play Policy Optimization for Zero-Sum Markov Games with Structured Transitions [145.54544979467872]
We propose and analyze new fictitious play policy optimization algorithms for zero-sum Markov games with structured but unknown transitions. We prove tight $widetildemathcalO(sqrtK)$ regret bounds after $K$ episodes in a two-agent competitive game scenario. Our algorithms feature a combination of Upper Confidence Bound (UCB)-type optimism and fictitious play under the scope of simultaneous policy optimization.
arXiv Detail & Related papers (2022-07-25T18:29:16Z)
Self-Play PSRO: Toward Optimal Populations in Two-Player Zero-Sum Games [69.5064797859053]
We introduce emphSelf-Play PSRO (SP-PSRO), a method that adds an approximately optimal policy to the population in each iteration. SP-PSRO empirically tends to converge much faster than APSRO and in many games converge in just a few iterations.
arXiv Detail & Related papers (2022-07-13T22:55:51Z)
Efficient Policy Space Response Oracles [61.71849698253696]
Policy Space Response Oracle method (PSRO) provides a general solution to Nash equilibrium in two-player zero-sum games. Central to our development is the newly-introduced of minimax optimization on unrestricted-restricted (URR) games. We report a 50x speedup in wall-time, 10x data efficiency, and similar exploitability as existing PSRO methods on Kuhn and Leduc Poker games.
arXiv Detail & Related papers (2022-01-28T17:54:45Z)
Online Double Oracle [20.291016382324777]
This paper proposes new learning algorithms in two-player zero-sum games where the number of pure strategies is huge or infinite. Our method achieves the regret bound of $mathcalO(sqrtT k log(k))$ in self-play setting where $k$ is NOT the size of the game.
arXiv Detail & Related papers (2021-03-13T19:48:27Z)
XDO: A Double Oracle Algorithm for Extensive-Form Games [14.823154995416997]
We propose an extensive-form double oracle algorithm that converges to an approximate Nash equilibrium linearly in the number of infostates. Unlike PSRO, which mixes best responses at the root of the game, XDO mixes best responses at every infostate. In experiments on a modified Leduc poker game, we show that XDO achieves over 11x lower exploitability than CFR and over 82x lower exploitability than PSRO and XFP.
arXiv Detail & Related papers (2021-03-11T03:05:44Z)
Almost Optimal Algorithms for Two-player Markov Games with Linear Function Approximation [92.99933928528797]
We study reinforcement learning for two-player zero-sum Markov games with simultaneous moves. We propose an algorithm Nash-UCRL-VTR based on the principle "Optimism-in-Face-of-Uncertainty" We show that Nash-UCRL-VTR can provably achieve an $tildeO(dHsqrtT)$ regret, where $d$ is the linear function dimension.
arXiv Detail & Related papers (2021-02-15T09:09:16Z)
Pipeline PSRO: A Scalable Approach for Finding Approximate Nash Equilibria in Large Games [11.866835246140647]
Policy Space Response Oracles (PSRO) is a deep reinforcement learning algorithm guaranteed to converge to an approximate Nash equilibrium. We introduce Pipeline PSRO, the first scalable general method for finding approximate Nash equilibria in large games. We also introduce an open-source environment for Barrage Stratego, a variant of Stratego with an approximate game complexity tree of $1050$.
arXiv Detail & Related papers (2020-06-15T17:17:17Z)
Efficient exploration of zero-sum stochastic games [83.28949556413717]
We investigate the increasingly important and common game-solving setting where we do not have an explicit description of the game but only oracle access to it through gameplay. During a limited-duration learning phase, the algorithm can control the actions of both players in order to try to learn the game and how to play it well. Our motivation is to quickly learn strategies that have low exploitability in situations where evaluating the payoffs of a queried strategy profile is costly.
arXiv Detail & Related papers (2020-02-24T20:30:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.