Joint Policy Search for Multi-agent Collaboration with Imperfect
Information
- URL: http://arxiv.org/abs/2008.06495v5
- Date: Sun, 6 Dec 2020 01:10:09 GMT
- Title: Joint Policy Search for Multi-agent Collaboration with Imperfect
Information
- Authors: Yuandong Tian, Qucheng Gong, Tina Jiang
- Abstract summary: We show global changes of game values can be decomposed to policy changes localized at each information set.
We propose Joint Policy Search that iteratively improves joint policies of collaborative agents in imperfect information games.
- Score: 31.559835225116473
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: To learn good joint policies for multi-agent collaboration with imperfect
information remains a fundamental challenge. While for two-player zero-sum
games, coordinate-ascent approaches (optimizing one agent's policy at a time,
e.g., self-play) work with guarantees, in multi-agent cooperative setting they
often converge to sub-optimal Nash equilibrium. On the other hand, directly
modeling joint policy changes in imperfect information game is nontrivial due
to complicated interplay of policies (e.g., upstream updates affect downstream
state reachability). In this paper, we show global changes of game values can
be decomposed to policy changes localized at each information set, with a novel
term named policy-change density. Based on this, we propose Joint Policy
Search(JPS) that iteratively improves joint policies of collaborative agents in
imperfect information games, without re-evaluating the entire game. On
multi-agent collaborative tabular games, JPS is proven to never worsen
performance and can improve solutions provided by unilateral approaches (e.g,
CFR), outperforming algorithms designed for collaborative policy learning (e.g.
BAD). Furthermore, for real-world games, JPS has an online form that naturally
links with gradient updates. We test it to Contract Bridge, a 4-player
imperfect-information game where a team of $2$ collaborates to compete against
the other. In its bidding phase, players bid in turn to find a good contract
through a limited information channel. Based on a strong baseline agent that
bids competitive bridge purely through domain-agnostic self-play, JPS improves
collaboration of team players and outperforms WBridge5, a championship-winning
software, by $+0.63$ IMPs (International Matching Points) per board over 1k
games, substantially better than previous SoTA ($+0.41$ IMPs/b) under
Double-Dummy evaluation.
Related papers
- N-Agent Ad Hoc Teamwork [36.10108537776956]
Current approaches to learning cooperative multi-agent behaviors assume relatively restrictive settings.
This paper formalizes the problem, and proposes the Policy Optimization with Agent Modelling (POAM) algorithm.
POAM is a policy gradient, multi-agent reinforcement learning approach to the NAHT problem, that enables adaptation to diverse teammate behaviors.
arXiv Detail & Related papers (2024-04-16T17:13:08Z) - Leading the Pack: N-player Opponent Shaping [52.682734939786464]
We extend Opponent Shaping (OS) methods to environments involving multiple co-players and multiple shaping agents.
We find that when playing with a large number of co-players, OS methods' relative performance reduces, suggesting that in the limit OS methods may not perform well.
arXiv Detail & Related papers (2023-12-19T20:01:42Z) - Fictitious Cross-Play: Learning Global Nash Equilibrium in Mixed
Cooperative-Competitive Games [14.979239870856535]
Self-play (SP) is a popular reinforcement learning framework for solving competitive games.
In this work, we develop a novel algorithm, Fictitious Cross-Play (FXP), which inherits the benefits from both frameworks.
arXiv Detail & Related papers (2023-10-05T07:19:33Z) - Provably Efficient Fictitious Play Policy Optimization for Zero-Sum
Markov Games with Structured Transitions [145.54544979467872]
We propose and analyze new fictitious play policy optimization algorithms for zero-sum Markov games with structured but unknown transitions.
We prove tight $widetildemathcalO(sqrtK)$ regret bounds after $K$ episodes in a two-agent competitive game scenario.
Our algorithms feature a combination of Upper Confidence Bound (UCB)-type optimism and fictitious play under the scope of simultaneous policy optimization.
arXiv Detail & Related papers (2022-07-25T18:29:16Z) - Policy Optimization for Markov Games: Unified Framework and Faster
Convergence [81.3266426402464]
We show that the state-wise average policy of this algorithm converges to an approximate Nash equilibrium (NE) of the game.
We extend this algorithm to multi-player general Markov Games and show a $mathcalwidetildeO(T-1/2)$ convergence rate to Correlated Equilibria (CCE)
arXiv Detail & Related papers (2022-06-06T14:23:13Z) - Decentralized Optimistic Hyperpolicy Mirror Descent: Provably No-Regret
Learning in Markov Games [95.10091348976779]
We study decentralized policy learning in Markov games where we control a single agent to play with nonstationary and possibly adversarial opponents.
We propose a new algorithm, underlineDecentralized underlineOptimistic hypeunderlineRpolicy munderlineIrror deunderlineScent (DORIS)
DORIS achieves $sqrtK$-regret in the context of general function approximation, where $K$ is the number of episodes.
arXiv Detail & Related papers (2022-06-03T14:18:05Z) - Multi-Agent Coordination in Adversarial Environments through Signal
Mediated Strategies [37.00818384785628]
Team members can coordinate their strategies before the beginning of the game, but are unable to communicate during the playing phase of the game.
In this setting, model-free RL methods are oftentimes unable to capture coordination because agents' policies are executed in a decentralized fashion.
We show convergence to coordinated equilibria in cases where previous state-of-the-art multi-agent RL algorithms did not.
arXiv Detail & Related papers (2021-02-09T18:44:16Z) - Multi-Agent Collaboration via Reward Attribution Decomposition [75.36911959491228]
We propose Collaborative Q-learning (CollaQ) that achieves state-of-the-art performance in the StarCraft multi-agent challenge.
CollaQ is evaluated on various StarCraft Attribution maps and shows that it outperforms existing state-of-the-art techniques.
arXiv Detail & Related papers (2020-10-16T17:42:11Z) - Learning to Play No-Press Diplomacy with Best Response Policy Iteration [31.367850729299665]
We apply deep reinforcement learning methods to Diplomacy, a 7-player board game.
We show that our agents convincingly outperform the previous state-of-the-art, and game theoretic equilibrium analysis shows that the new process yields consistent improvements.
arXiv Detail & Related papers (2020-06-08T14:33:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.