Pipeline PSRO: A Scalable Approach for Finding Approximate Nash
Equilibria in Large Games
- URL: http://arxiv.org/abs/2006.08555v2
- Date: Thu, 18 Feb 2021 19:26:01 GMT
- Title: Pipeline PSRO: A Scalable Approach for Finding Approximate Nash
Equilibria in Large Games
- Authors: Stephen McAleer, John Lanier, Roy Fox, Pierre Baldi
- Abstract summary: Policy Space Response Oracles (PSRO) is a deep reinforcement learning algorithm guaranteed to converge to an approximate Nash equilibrium.
We introduce Pipeline PSRO, the first scalable general method for finding approximate Nash equilibria in large games.
We also introduce an open-source environment for Barrage Stratego, a variant of Stratego with an approximate game complexity tree of $1050$.
- Score: 11.866835246140647
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Finding approximate Nash equilibria in zero-sum imperfect-information games
is challenging when the number of information states is large. Policy Space
Response Oracles (PSRO) is a deep reinforcement learning algorithm grounded in
game theory that is guaranteed to converge to an approximate Nash equilibrium.
However, PSRO requires training a reinforcement learning policy at each
iteration, making it too slow for large games. We show through counterexamples
and experiments that DCH and Rectified PSRO, two existing approaches to scaling
up PSRO, fail to converge even in small games. We introduce Pipeline PSRO
(P2SRO), the first scalable general method for finding approximate Nash
equilibria in large zero-sum imperfect-information games. P2SRO is able to
parallelize PSRO with convergence guarantees by maintaining a hierarchical
pipeline of reinforcement learning workers, each training against the policies
generated by lower levels in the hierarchy. We show that unlike existing
methods, P2SRO converges to an approximate Nash equilibrium, and does so faster
as the number of parallel workers increases, across a variety of imperfect
information games. We also introduce an open-source environment for Barrage
Stratego, a variant of Stratego with an approximate game tree complexity of
$10^{50}$. P2SRO is able to achieve state-of-the-art performance on Barrage
Stratego and beats all existing bots. Experiment code is available
athttps://github.com/JBLanier/pipeline-psro.
Related papers
- On Tractable $Φ$-Equilibria in Non-Concave Games [53.212133025684224]
Non-concave games introduce significant game-theoretic and optimization challenges.
We show that when $Phi$ is finite, there exists an efficient uncoupled learning algorithm that converges to the corresponding $Phi$-equilibria.
We also show that Online Gradient Descent can efficiently approximate $Phi$-equilibria in non-trivial regimes.
arXiv Detail & Related papers (2024-03-13T01:51:30Z) - Neural Population Learning beyond Symmetric Zero-sum Games [52.20454809055356]
We introduce NeuPL-JPSRO, a neural population learning algorithm that benefits from transfer learning of skills and converges to a Coarse Correlated (CCE) of the game.
Our work shows that equilibrium convergent population learning can be implemented at scale and in generality.
arXiv Detail & Related papers (2024-01-10T12:56:24Z) - Fictitious Cross-Play: Learning Global Nash Equilibrium in Mixed
Cooperative-Competitive Games [14.979239870856535]
Self-play (SP) is a popular reinforcement learning framework for solving competitive games.
In this work, we develop a novel algorithm, Fictitious Cross-Play (FXP), which inherits the benefits from both frameworks.
arXiv Detail & Related papers (2023-10-05T07:19:33Z) - Global Nash Equilibrium in Non-convex Multi-player Game: Theory and
Algorithms [66.8634598612777]
We show that Nash equilibrium (NE) is acceptable to all players in a multi-player game.
We also show that no one can benefit unilaterally from the general theory step by step.
arXiv Detail & Related papers (2023-01-19T11:36:50Z) - Provably Efficient Fictitious Play Policy Optimization for Zero-Sum
Markov Games with Structured Transitions [145.54544979467872]
We propose and analyze new fictitious play policy optimization algorithms for zero-sum Markov games with structured but unknown transitions.
We prove tight $widetildemathcalO(sqrtK)$ regret bounds after $K$ episodes in a two-agent competitive game scenario.
Our algorithms feature a combination of Upper Confidence Bound (UCB)-type optimism and fictitious play under the scope of simultaneous policy optimization.
arXiv Detail & Related papers (2022-07-25T18:29:16Z) - Self-Play PSRO: Toward Optimal Populations in Two-Player Zero-Sum Games [69.5064797859053]
We introduce emphSelf-Play PSRO (SP-PSRO), a method that adds an approximately optimal policy to the population in each iteration.
SP-PSRO empirically tends to converge much faster than APSRO and in many games converge in just a few iterations.
arXiv Detail & Related papers (2022-07-13T22:55:51Z) - Efficient Policy Space Response Oracles [61.71849698253696]
Policy Space Response Oracle method (PSRO) provides a general solution to Nash equilibrium in two-player zero-sum games.
Central to our development is the newly-introduced of minimax optimization on unrestricted-restricted (URR) games.
We report a 50x speedup in wall-time, 10x data efficiency, and similar exploitability as existing PSRO methods on Kuhn and Leduc Poker games.
arXiv Detail & Related papers (2022-01-28T17:54:45Z) - Anytime Optimal PSRO for Two-Player Zero-Sum Games [17.821479538423155]
Policy Space Response Oracles (PSRO) is a reinforcement learning algorithm for games that can handle continuous actions.
AODO is a double oracle algorithm for 2-player zero-sum games that converges to a Nash equilibrium.
We show that our methods achieve far lower exploitability than DO and PSRO and never increase exploitability.
arXiv Detail & Related papers (2022-01-19T16:34:11Z) - XDO: A Double Oracle Algorithm for Extensive-Form Games [14.823154995416997]
We propose an extensive-form double oracle algorithm that converges to an approximate Nash equilibrium linearly in the number of infostates.
Unlike PSRO, which mixes best responses at the root of the game, XDO mixes best responses at every infostate.
In experiments on a modified Leduc poker game, we show that XDO achieves over 11x lower exploitability than CFR and over 82x lower exploitability than PSRO and XFP.
arXiv Detail & Related papers (2021-03-11T03:05:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.