Related papers: XDO: A Double Oracle Algorithm for Extensive-Form Games

XDO: A Double Oracle Algorithm for Extensive-Form Games

URL: http://arxiv.org/abs/2103.06426v1
Date: Thu, 11 Mar 2021 03:05:44 GMT
Title: XDO: A Double Oracle Algorithm for Extensive-Form Games
Authors: Stephen McAleer, John Lanier, Pierre Baldi, Roy Fox
Abstract summary: We propose an extensive-form double oracle algorithm that converges to an approximate Nash equilibrium linearly in the number of infostates. Unlike PSRO, which mixes best responses at the root of the game, XDO mixes best responses at every infostate. In experiments on a modified Leduc poker game, we show that XDO achieves over 11x lower exploitability than CFR and over 82x lower exploitability than PSRO and XFP.
Score: 14.823154995416997
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Policy Space Response Oracles (PSRO) is a deep reinforcement learning algorithm for two-player zero-sum games that has empirically found approximate Nash equilibria in large games. Although PSRO is guaranteed to converge to a Nash equilibrium, it may take an exponential number of iterations as the number of infostates grows. We propose Extensive-Form Double Oracle (XDO), an extensive-form double oracle algorithm that is guaranteed to converge to an approximate Nash equilibrium linearly in the number of infostates. Unlike PSRO, which mixes best responses at the root of the game, XDO mixes best responses at every infostate. We also introduce Neural XDO (NXDO), where the best response is learned through deep RL. In tabular experiments on Leduc poker, we find that XDO achieves an approximate Nash equilibrium in a number of iterations 1-2 orders of magnitude smaller than PSRO. In experiments on a modified Leduc poker game, we show that tabular XDO achieves over 11x lower exploitability than CFR and over 82x lower exploitability than PSRO and XFP in the same amount of time. We also show that NXDO beats PSRO and is competitive with NFSP on a large no-limit poker game.

Related papers

Instance-Dependent Regret Bounds for Learning Two-Player Zero-Sum Games with Bandit Feedback [60.610120215789976]
We show that when a pure strategy Nash equilibrium exists, $c$ becomes zero, leading to an optimal instance-dependent regret bound. Our algorithm also enjoys last-iterate convergence and can identify the pure strategy Nash equilibrium with near-optimal sample.
arXiv Detail & Related papers (2025-02-24T20:20:06Z)
Fast Last-Iterate Convergence of Learning in Games Requires Forgetful Algorithms [71.73971094342349]
Self-play via online learning is one of the premier ways to solve large-scale two-player zero-sum games. We show that OMWU offers several advantages including logarithmic dependence on the size of the payoff matrix. We prove that a broad class of algorithms that do not forget the past quickly all suffer the same issue.
arXiv Detail & Related papers (2024-06-15T13:26:17Z)
Global Nash Equilibrium in Non-convex Multi-player Game: Theory and Algorithms [66.8634598612777]
We show that Nash equilibrium (NE) is acceptable to all players in a multi-player game. We also show that no one can benefit unilaterally from the general theory step by step.
arXiv Detail & Related papers (2023-01-19T11:36:50Z)
Self-Play PSRO: Toward Optimal Populations in Two-Player Zero-Sum Games [69.5064797859053]
We introduce emphSelf-Play PSRO (SP-PSRO), a method that adds an approximately optimal policy to the population in each iteration. SP-PSRO empirically tends to converge much faster than APSRO and in many games converge in just a few iterations.
arXiv Detail & Related papers (2022-07-13T22:55:51Z)
Near-Optimal Learning of Extensive-Form Games with Imperfect Information [54.55092907312749]
We present the first line of algorithms that require only $widetildemathcalO((XA+YB)/varepsilon2)$ episodes of play to find an $varepsilon$-approximate Nash equilibrium in two-player zero-sum games. This improves upon the best known sample complexity of $widetildemathcalO((X2A+Y2B)/varepsilon2)$ by a factor of $widetildemathcalO(maxX,
arXiv Detail & Related papers (2022-02-03T18:18:28Z)
Efficient Policy Space Response Oracles [61.71849698253696]
Policy Space Response Oracle method (PSRO) provides a general solution to Nash equilibrium in two-player zero-sum games. Central to our development is the newly-introduced of minimax optimization on unrestricted-restricted (URR) games. We report a 50x speedup in wall-time, 10x data efficiency, and similar exploitability as existing PSRO methods on Kuhn and Leduc Poker games.
arXiv Detail & Related papers (2022-01-28T17:54:45Z)
Anytime Optimal PSRO for Two-Player Zero-Sum Games [17.821479538423155]
Policy Space Response Oracles (PSRO) is a reinforcement learning algorithm for games that can handle continuous actions. AODO is a double oracle algorithm for 2-player zero-sum games that converges to a Nash equilibrium. We show that our methods achieve far lower exploitability than DO and PSRO and never increase exploitability.
arXiv Detail & Related papers (2022-01-19T16:34:11Z)
Online Double Oracle [20.291016382324777]
This paper proposes new learning algorithms in two-player zero-sum games where the number of pure strategies is huge or infinite. Our method achieves the regret bound of $mathcalO(sqrtT k log(k))$ in self-play setting where $k$ is NOT the size of the game.
arXiv Detail & Related papers (2021-03-13T19:48:27Z)
Almost Optimal Algorithms for Two-player Markov Games with Linear Function Approximation [92.99933928528797]
We study reinforcement learning for two-player zero-sum Markov games with simultaneous moves. We propose an algorithm Nash-UCRL-VTR based on the principle "Optimism-in-Face-of-Uncertainty" We show that Nash-UCRL-VTR can provably achieve an $tildeO(dHsqrtT)$ regret, where $d$ is the linear function dimension.
arXiv Detail & Related papers (2021-02-15T09:09:16Z)
Combining Deep Reinforcement Learning and Search for Imperfect-Information Games [30.520629802135574]
We present ReBeL, a framework for self-play reinforcement learning and search provably converges to a Nash equilibrium in zero-sum games. We also show ReBeL achieves performance in heads-up no-limit Texas hold'em poker, while using far less domain knowledge than any prior poker AI.
arXiv Detail & Related papers (2020-07-27T15:21:22Z)
Pipeline PSRO: A Scalable Approach for Finding Approximate Nash Equilibria in Large Games [11.866835246140647]
Policy Space Response Oracles (PSRO) is a deep reinforcement learning algorithm guaranteed to converge to an approximate Nash equilibrium. We introduce Pipeline PSRO, the first scalable general method for finding approximate Nash equilibria in large games. We also introduce an open-source environment for Barrage Stratego, a variant of Stratego with an approximate game complexity tree of $1050$.
arXiv Detail & Related papers (2020-06-15T17:17:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.