Modeling Strong and Human-Like Gameplay with KL-Regularized Search
- URL: http://arxiv.org/abs/2112.07544v1
- Date: Tue, 14 Dec 2021 16:52:49 GMT
- Title: Modeling Strong and Human-Like Gameplay with KL-Regularized Search
- Authors: Athul Paul Jacob, David J. Wu, Gabriele Farina, Adam Lerer, Anton
Bakhtin, Jacob Andreas, Noam Brown
- Abstract summary: We consider the task of building strong but human-like policies in multi-agent decision-making problems.
Imitation learning is effective at predicting human actions but may not match the strength of expert humans.
We show in chess and Go that regularizing search policies based on the KL divergence from an imitation-learned policy by applying Monte Carlo tree search produces policies that have higher human prediction accuracy and are stronger than the imitation policy.
- Score: 64.24339197581769
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We consider the task of building strong but human-like policies in
multi-agent decision-making problems, given examples of human behavior.
Imitation learning is effective at predicting human actions but may not match
the strength of expert humans, while self-play learning and search techniques
(e.g. AlphaZero) lead to strong performance but may produce policies that are
difficult for humans to understand and coordinate with. We show in chess and Go
that regularizing search policies based on the KL divergence from an
imitation-learned policy by applying Monte Carlo tree search produces policies
that have higher human prediction accuracy and are stronger than the imitation
policy. We then introduce a novel regret minimization algorithm that is
regularized based on the KL divergence from an imitation-learned policy, and
show that applying this algorithm to no-press Diplomacy yields a policy that
maintains the same human prediction accuracy as imitation learning while being
substantially stronger.
Related papers
- Learning Optimal Deterministic Policies with Stochastic Policy Gradients [62.81324245896716]
Policy gradient (PG) methods are successful approaches to deal with continuous reinforcement learning (RL) problems.
In common practice, convergence (hyper)policies are learned only to deploy their deterministic version.
We show how to tune the exploration level used for learning to optimize the trade-off between the sample complexity and the performance of the deployed deterministic policy.
arXiv Detail & Related papers (2024-05-03T16:45:15Z) - Reinforcement Learning with Human Feedback: Learning Dynamic Choices via
Pessimism [91.52263068880484]
We study offline Reinforcement Learning with Human Feedback (RLHF)
We aim to learn the human's underlying reward and the MDP's optimal policy from a set of trajectories induced by human choices.
RLHF is challenging for multiple reasons: large state space but limited human feedback, the bounded rationality of human decisions, and the off-policy distribution shift.
arXiv Detail & Related papers (2023-05-29T01:18:39Z) - Imitating Opponent to Win: Adversarial Policy Imitation Learning in
Two-player Competitive Games [0.0]
adversarial policies adopted by an adversary agent can influence a target RL agent to perform poorly in a multi-agent environment.
In existing studies, adversarial policies are directly trained based on experiences of interacting with the victim agent.
We design a new effective adversarial policy learning algorithm that overcomes this shortcoming.
arXiv Detail & Related papers (2022-10-30T18:32:02Z) - Human-AI Coordination via Human-Regularized Search and Learning [33.95649252941375]
We develop a three-step algorithm that achieve strong performance in coordinating with real humans in the Hanabi benchmark.
We first use a regularized search algorithm and behavioral cloning to produce a better human model that captures diverse skill levels.
We show that our method beats a vanilla best response to behavioral cloning baseline by having experts play repeatedly with the two agents.
arXiv Detail & Related papers (2022-10-11T03:46:12Z) - Synthesizing Policies That Account For Human Execution Errors Caused By
StateAliasing In Markov Decision Processes [15.450115485745767]
An optimal MDP policy that is poorly ex-ecuted (because of a human agent) maybe much worse thananother policy that is executed with fewer errors.
We present a framework to model the likelihood ofpolicy execution errors and likelihood of non-policy actionslike inaction (delays) due to state uncertainty.
We then use the best policy found byhill climbing with a branch and bound algorithm to find the optimal policy.
arXiv Detail & Related papers (2021-09-15T17:10:46Z) - Goal-Conditioned Reinforcement Learning with Imagined Subgoals [89.67840168694259]
We propose to incorporate imagined subgoals into policy learning to facilitate learning of complex tasks.
Imagined subgoals are predicted by a separate high-level policy, which is trained simultaneously with the policy and its critic.
We evaluate our approach on complex robotic navigation and manipulation tasks and show that it outperforms existing methods by a large margin.
arXiv Detail & Related papers (2021-07-01T15:30:59Z) - Imitation Learning from MPC for Quadrupedal Multi-Gait Control [63.617157490920505]
We present a learning algorithm for training a single policy that imitates multiple gaits of a walking robot.
We use and extend MPC-Net, which is an Imitation Learning approach guided by Model Predictive Control.
We validate our approach on hardware and show that a single learned policy can replace its teacher to control multiple gaits.
arXiv Detail & Related papers (2021-03-26T08:48:53Z) - School of hard knocks: Curriculum analysis for Pommerman with a fixed
computational budget [4.726777092009554]
Pommerman is a hybrid cooperative/adversarial multi-agent environment.
This makes it a challenging environment for reinforcement learning approaches.
We develop a curriculum for learning a robust and promising policy in a constrained computational budget of 100,000 games.
arXiv Detail & Related papers (2021-02-23T15:43:09Z) - Policy Supervectors: General Characterization of Agents by their
Behaviour [18.488655590845163]
We propose policy supervectors for characterizing agents by the distribution of states they visit.
Policy supervectors can characterize policies regardless of their design philosophy and scale to thousands of policies on a single workstation machine.
We demonstrate method's applicability by studying the evolution of policies during reinforcement learning, evolutionary training and imitation learning.
arXiv Detail & Related papers (2020-12-02T14:43:16Z) - Reward-Conditioned Policies [100.64167842905069]
imitation learning requires near-optimal expert data.
Can we learn effective policies via supervised learning without demonstrations?
We show how such an approach can be derived as a principled method for policy search.
arXiv Detail & Related papers (2019-12-31T18:07:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.