Learning Diverse Risk Preferences in Population-based Self-play
- URL: http://arxiv.org/abs/2305.11476v2
- Date: Fri, 15 Dec 2023 08:06:38 GMT
- Title: Learning Diverse Risk Preferences in Population-based Self-play
- Authors: Yuhua Jiang, Qihan Liu, Xiaoteng Ma, Chenghao Li, Yiqin Yang, Jun
Yang, Bin Liang, Qianchuan Zhao
- Abstract summary: Current self-play algorithms optimize the agent to maximize expected win-rates against its current or historical copies.
We introduce diversity from the perspective that agents could have diverse risk preferences in the face of uncertainty.
We show that our method achieves comparable or superior performance in competitive games.
- Score: 23.07952140353786
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Among the great successes of Reinforcement Learning (RL), self-play
algorithms play an essential role in solving competitive games. Current
self-play algorithms optimize the agent to maximize expected win-rates against
its current or historical copies, making it often stuck in the local optimum
and its strategy style simple and homogeneous. A possible solution is to
improve the diversity of policies, which helps the agent break the stalemate
and enhances its robustness when facing different opponents. However, enhancing
diversity in the self-play algorithms is not trivial. In this paper, we aim to
introduce diversity from the perspective that agents could have diverse risk
preferences in the face of uncertainty. Specifically, we design a novel
reinforcement learning algorithm called Risk-sensitive Proximal Policy
Optimization (RPPO), which smoothly interpolates between worst-case and
best-case policy learning and allows for policy learning with desired risk
preferences. Seamlessly integrating RPPO with population-based self-play,
agents in the population optimize dynamic risk-sensitive objectives with
experiences from playing against diverse opponents. Empirical results show that
our method achieves comparable or superior performance in competitive games and
that diverse modes of behaviors emerge. Our code is public online at
\url{https://github.com/Jackory/RPBT}.
Related papers
- Local Optimization Achieves Global Optimality in Multi-Agent
Reinforcement Learning [139.53668999720605]
We present a multi-agent PPO algorithm in which the local policy of each agent is updated similarly to vanilla PPO.
We prove that with standard regularity conditions on the Markov game and problem-dependent quantities, our algorithm converges to the globally optimal policy at a sublinear rate.
arXiv Detail & Related papers (2023-05-08T16:20:03Z) - Provably Efficient Fictitious Play Policy Optimization for Zero-Sum
Markov Games with Structured Transitions [145.54544979467872]
We propose and analyze new fictitious play policy optimization algorithms for zero-sum Markov games with structured but unknown transitions.
We prove tight $widetildemathcalO(sqrtK)$ regret bounds after $K$ episodes in a two-agent competitive game scenario.
Our algorithms feature a combination of Upper Confidence Bound (UCB)-type optimism and fictitious play under the scope of simultaneous policy optimization.
arXiv Detail & Related papers (2022-07-25T18:29:16Z) - Decentralized Optimistic Hyperpolicy Mirror Descent: Provably No-Regret
Learning in Markov Games [95.10091348976779]
We study decentralized policy learning in Markov games where we control a single agent to play with nonstationary and possibly adversarial opponents.
We propose a new algorithm, underlineDecentralized underlineOptimistic hypeunderlineRpolicy munderlineIrror deunderlineScent (DORIS)
DORIS achieves $sqrtK$-regret in the context of general function approximation, where $K$ is the number of episodes.
arXiv Detail & Related papers (2022-06-03T14:18:05Z) - CAMEO: Curiosity Augmented Metropolis for Exploratory Optimal Policies [62.39667564455059]
We consider and study a distribution of optimal policies.
In experimental simulations we show that CAMEO indeed obtains policies that all solve classic control problems.
We further show that the different policies we sample present different risk profiles, corresponding to interesting practical applications in interpretability.
arXiv Detail & Related papers (2022-05-19T09:48:56Z) - Risk-Sensitive Bayesian Games for Multi-Agent Reinforcement Learning
under Policy Uncertainty [6.471031681646443]
In games with incomplete information, the uncertainty is evoked by the lack of knowledge about a player's own and the other players' types.
We propose risk-sensitive versions of existing algorithms for risk-neutral learning games.
Our experimental analysis shows that risk-sensitive DAPG performs better than competing algorithms for both social welfare and general-sum games.
arXiv Detail & Related papers (2022-03-18T16:40:30Z) - Policy Gradient Bayesian Robust Optimization for Imitation Learning [49.881386773269746]
We derive a novel policy gradient-style robust optimization approach, PG-BROIL, to balance expected performance and risk.
Results suggest PG-BROIL can produce a family of behaviors ranging from risk-neutral to risk-averse.
arXiv Detail & Related papers (2021-06-11T16:49:15Z) - Unifying Behavioral and Response Diversity for Open-ended Learning in
Zero-sum Games [44.30509625560908]
In open-ended learning algorithms, there are no widely accepted definitions for diversity, making it hard to construct and evaluate the diverse policies.
We propose a unified measure of diversity in multi-agent open-ended learning based on both Behavioral Diversity (BD) and Response Diversity (RD)
We show that many current diversity measures fall in one of the categories of BD or RD but not both.
With this unified diversity measure, we design the corresponding diversity-promoting objective and population effectivity when seeking the best responses in open-ended learning.
arXiv Detail & Related papers (2021-06-09T10:11:06Z) - Discovering Diverse Multi-Agent Strategic Behavior via Reward
Randomization [42.33734089361143]
We propose a technique for discovering diverse strategic policies in complex multi-agent games.
We derive a new algorithm, Reward-Randomized Policy Gradient (RPG)
RPG is able to discover multiple distinctive human-interpretable strategies in challenging temporal trust dilemmas.
arXiv Detail & Related papers (2021-03-08T06:26:55Z) - Robust Reinforcement Learning using Adversarial Populations [118.73193330231163]
Reinforcement Learning (RL) is an effective tool for controller design but can struggle with issues of robustness.
We show that using a single adversary does not consistently yield robustness to dynamics variations under standard parametrizations of the adversary.
We propose a population-based augmentation to the Robust RL formulation in which we randomly initialize a population of adversaries and sample from the population uniformly during training.
arXiv Detail & Related papers (2020-08-04T20:57:32Z) - Bayesian Robust Optimization for Imitation Learning [34.40385583372232]
Inverse reinforcement learning can enable generalization to new states by learning a parameterized reward function.
Existing safe imitation learning approaches based on IRL deal with this uncertainty using a maxmin framework.
BROIL provides a natural way to interpolate between return-maximizing and risk-minimizing behaviors.
arXiv Detail & Related papers (2020-07-24T01:52:11Z) - Improving Generalization of Reinforcement Learning with Minimax
Distributional Soft Actor-Critic [11.601356612579641]
This paper introduces the minimax formulation and distributional framework to improve the generalization ability of RL algorithms.
We implement our method on the decision-making tasks of autonomous vehicles at intersections and test the trained policy in distinct environments.
arXiv Detail & Related papers (2020-02-13T14:09:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.