Related papers: Provably Efficient Fictitious Play Policy Optimization for Zero-Sum Markov Games with Structured Transitions

Provably Efficient Fictitious Play Policy Optimization for Zero-Sum Markov Games with Structured Transitions

URL: http://arxiv.org/abs/2207.12463v1
Date: Mon, 25 Jul 2022 18:29:16 GMT
Title: Provably Efficient Fictitious Play Policy Optimization for Zero-Sum Markov Games with Structured Transitions
Authors: Shuang Qiu, Xiaohan Wei, Jieping Ye, Zhaoran Wang, Zhuoran Yang
Abstract summary: We propose and analyze new fictitious play policy optimization algorithms for zero-sum Markov games with structured but unknown transitions. We prove tight $widetildemathcalO(sqrtK)$ regret bounds after $K$ episodes in a two-agent competitive game scenario. Our algorithms feature a combination of Upper Confidence Bound (UCB)-type optimism and fictitious play under the scope of simultaneous policy optimization.
Score: 145.54544979467872
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: While single-agent policy optimization in a fixed environment has attracted a lot of research attention recently in the reinforcement learning community, much less is known theoretically when there are multiple agents playing in a potentially competitive environment. We take steps forward by proposing and analyzing new fictitious play policy optimization algorithms for zero-sum Markov games with structured but unknown transitions. We consider two classes of transition structures: factored independent transition and single-controller transition. For both scenarios, we prove tight $\widetilde{\mathcal{O}}(\sqrt{K})$ regret bounds after $K$ episodes in a two-agent competitive game scenario. The regret of each agent is measured against a potentially adversarial opponent who can choose a single best policy in hindsight after observing the full policy sequence. Our algorithms feature a combination of Upper Confidence Bound (UCB)-type optimism and fictitious play under the scope of simultaneous policy optimization in a non-stationary environment. When both players adopt the proposed algorithms, their overall optimality gap is $\widetilde{\mathcal{O}}(\sqrt{K})$.

Related papers

Two-Player Zero-Sum Games with Bandit Feedback [6.66618805642802]
We study a two-player zero-sum game in which the row player aims to maximize their payoff against an adversarial column player.<n>We propose three algorithms based on the Explore-Then-Commit framework.
arXiv Detail & Related papers (2025-06-17T13:46:32Z)
Multi-Step Alignment as Markov Games: An Optimistic Online Gradient Descent Approach with Convergence Guarantees [91.88803125231189]
Multi-step Preference Optimization (MPO) is built upon the natural actor-critic frameworkciteprakhlin2013online,joulani17a. We show that OMPO requires $mathcalO(epsilon-1)$ policy updates to converge to an $epsilon$-approximate Nash equilibrium. We also validate the effectiveness of our method on multi-turn conversations dataset and math reasoning dataset.
arXiv Detail & Related papers (2025-02-18T09:33:48Z)
Optimistic Policy Gradient in Multi-Player Markov Games with a Single Controller: Convergence Beyond the Minty Property [89.96815099996132]
We develop a new framework to characterize optimistic policy gradient methods in multi-player games with a single controller. Our approach relies on a natural generalization of the classical Minty property that we introduce, which we anticipate to have further applications beyond Markov games.
arXiv Detail & Related papers (2023-12-19T11:34:10Z)
Scalable and Independent Learning of Nash Equilibrium Policies in $n$-Player Stochastic Games with Unknown Independent Chains [1.0878040851638]
We study games with independent chains and unknown transition matrices. In this class of games, players control their own internal Markov chains whose transitions do not depend on the states/actions of other players. We propose a fully decentralized mirror descent algorithm to learn an $epsilon$-NE policy.
arXiv Detail & Related papers (2023-12-04T03:04:09Z)
Near-Optimal No-Regret Learning for General Convex Games [121.50979258049135]
We show that regret can be obtained for general convex and compact strategy sets. Our dynamics are on an instantiation of optimistic follow-the-regularized-bounds over an appropriately emphlifted space. Even in those special cases where prior results apply, our algorithm improves over the state-of-the-art regret.
arXiv Detail & Related papers (2022-06-17T12:58:58Z)
Policy Optimization for Markov Games: Unified Framework and Faster Convergence [81.3266426402464]
We show that the state-wise average policy of this algorithm converges to an approximate Nash equilibrium (NE) of the game. We extend this algorithm to multi-player general Markov Games and show a $mathcalwidetildeO(T-1/2)$ convergence rate to Correlated Equilibria (CCE)
arXiv Detail & Related papers (2022-06-06T14:23:13Z)
Decentralized Optimistic Hyperpolicy Mirror Descent: Provably No-Regret Learning in Markov Games [95.10091348976779]
We study decentralized policy learning in Markov games where we control a single agent to play with nonstationary and possibly adversarial opponents. We propose a new algorithm, underlineDecentralized underlineOptimistic hypeunderlineRpolicy munderlineIrror deunderlineScent (DORIS) DORIS achieves $sqrtK$-regret in the context of general function approximation, where $K$ is the number of episodes.
arXiv Detail & Related papers (2022-06-03T14:18:05Z)
Independent Policy Gradient for Large-Scale Markov Potential Games: Sharper Rates, Function Approximation, and Game-Agnostic Convergence [30.084357461497042]
We learn a Nash equilibrium of an MPG in which the size of state space and/or the number of players can be very large. We propose new independent policy gradient algorithms that are run by all players in tandem. We identify a class of independent policy gradient algorithms that enjoys convergence for both zero-sum Markov games and Markov cooperative games with the players that are oblivious to the types of games being played.
arXiv Detail & Related papers (2022-02-08T20:09:47Z)
Towards convergence to Nash equilibria in two-team zero-sum games [17.4461045395989]
Two-team zero-sum games are defined as multi-player games where players are split into two competing sets of agents. We focus on the solution concept of Nash equilibria (NE) We show that computing NE for this class of games is $textithard$ for the complexity class $mathrm$.
arXiv Detail & Related papers (2021-11-07T21:15:35Z)
Learning Zero-Sum Simultaneous-Move Markov Games Using Function Approximation and Correlated Equilibrium [116.56359444619441]
We develop provably efficient reinforcement learning algorithms for two-player zero-sum finite-horizon Markov games. In the offline setting, we control both players and aim to find the Nash Equilibrium by minimizing the duality gap. In the online setting, we control a single player playing against an arbitrary opponent and aim to minimize the regret.
arXiv Detail & Related papers (2020-02-17T17:04:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.