Related papers: Modeling Strong and Human-Like Gameplay with KL-Regularized Search

Modeling Strong and Human-Like Gameplay with KL-Regularized Search

URL: http://arxiv.org/abs/2112.07544v1
Date: Tue, 14 Dec 2021 16:52:49 GMT
Title: Modeling Strong and Human-Like Gameplay with KL-Regularized Search
Authors: Athul Paul Jacob, David J. Wu, Gabriele Farina, Adam Lerer, Anton Bakhtin, Jacob Andreas, Noam Brown
Abstract summary: We consider the task of building strong but human-like policies in multi-agent decision-making problems. Imitation learning is effective at predicting human actions but may not match the strength of expert humans. We show in chess and Go that regularizing search policies based on the KL divergence from an imitation-learned policy by applying Monte Carlo tree search produces policies that have higher human prediction accuracy and are stronger than the imitation policy.
Score: 64.24339197581769
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We consider the task of building strong but human-like policies in multi-agent decision-making problems, given examples of human behavior. Imitation learning is effective at predicting human actions but may not match the strength of expert humans, while self-play learning and search techniques (e.g. AlphaZero) lead to strong performance but may produce policies that are difficult for humans to understand and coordinate with. We show in chess and Go that regularizing search policies based on the KL divergence from an imitation-learned policy by applying Monte Carlo tree search produces policies that have higher human prediction accuracy and are stronger than the imitation policy. We then introduce a novel regret minimization algorithm that is regularized based on the KL divergence from an imitation-learned policy, and show that applying this algorithm to no-press Diplomacy yields a policy that maintains the same human prediction accuracy as imitation learning while being substantially stronger.

Related papers

Dense Policy: Bidirectional Autoregressive Learning of Actions [51.60428100831717]
This paper introduces a bidirectionally expanded learning approach, termed Dense Policy, to establish a new paradigm for autoregressive policies in action prediction. It employs a lightweight encoder-only architecture to iteratively unfold the action sequence from an initial single frame into the target sequence in a coarse-to-fine manner. Experiments validate that our dense policy has superior autoregressive learning capabilities and can surpass existing holistic generative policies.
arXiv Detail & Related papers (2025-03-17T14:28:08Z)
Learning Optimal Deterministic Policies with Stochastic Policy Gradients [62.81324245896716]
Policy gradient (PG) methods are successful approaches to deal with continuous reinforcement learning (RL) problems. In common practice, convergence (hyper)policies are learned only to deploy their deterministic version. We show how to tune the exploration level used for learning to optimize the trade-off between the sample complexity and the performance of the deployed deterministic policy.
arXiv Detail & Related papers (2024-05-03T16:45:15Z)
Reinforcement Learning with Human Feedback: Learning Dynamic Choices via Pessimism [91.52263068880484]
We study offline Reinforcement Learning with Human Feedback (RLHF) We aim to learn the human's underlying reward and the MDP's optimal policy from a set of trajectories induced by human choices. RLHF is challenging for multiple reasons: large state space but limited human feedback, the bounded rationality of human decisions, and the off-policy distribution shift.
arXiv Detail & Related papers (2023-05-29T01:18:39Z)
Imitating Opponent to Win: Adversarial Policy Imitation Learning in Two-player Competitive Games [0.0]
adversarial policies adopted by an adversary agent can influence a target RL agent to perform poorly in a multi-agent environment. In existing studies, adversarial policies are directly trained based on experiences of interacting with the victim agent. We design a new effective adversarial policy learning algorithm that overcomes this shortcoming.
arXiv Detail & Related papers (2022-10-30T18:32:02Z)
Human-AI Coordination via Human-Regularized Search and Learning [33.95649252941375]
We develop a three-step algorithm that achieve strong performance in coordinating with real humans in the Hanabi benchmark. We first use a regularized search algorithm and behavioral cloning to produce a better human model that captures diverse skill levels. We show that our method beats a vanilla best response to behavioral cloning baseline by having experts play repeatedly with the two agents.
arXiv Detail & Related papers (2022-10-11T03:46:12Z)
Synthesizing Policies That Account For Human Execution Errors Caused By StateAliasing In Markov Decision Processes [15.450115485745767]
An optimal MDP policy that is poorly ex-ecuted (because of a human agent) maybe much worse thananother policy that is executed with fewer errors. We present a framework to model the likelihood ofpolicy execution errors and likelihood of non-policy actionslike inaction (delays) due to state uncertainty. We then use the best policy found byhill climbing with a branch and bound algorithm to find the optimal policy.
arXiv Detail & Related papers (2021-09-15T17:10:46Z)
Goal-Conditioned Reinforcement Learning with Imagined Subgoals [89.67840168694259]
We propose to incorporate imagined subgoals into policy learning to facilitate learning of complex tasks. Imagined subgoals are predicted by a separate high-level policy, which is trained simultaneously with the policy and its critic. We evaluate our approach on complex robotic navigation and manipulation tasks and show that it outperforms existing methods by a large margin.
arXiv Detail & Related papers (2021-07-01T15:30:59Z)
Imitation Learning from MPC for Quadrupedal Multi-Gait Control [63.617157490920505]
We present a learning algorithm for training a single policy that imitates multiple gaits of a walking robot. We use and extend MPC-Net, which is an Imitation Learning approach guided by Model Predictive Control. We validate our approach on hardware and show that a single learned policy can replace its teacher to control multiple gaits.
arXiv Detail & Related papers (2021-03-26T08:48:53Z)
School of hard knocks: Curriculum analysis for Pommerman with a fixed computational budget [4.726777092009554]
Pommerman is a hybrid cooperative/adversarial multi-agent environment. This makes it a challenging environment for reinforcement learning approaches. We develop a curriculum for learning a robust and promising policy in a constrained computational budget of 100,000 games.
arXiv Detail & Related papers (2021-02-23T15:43:09Z)
Policy Supervectors: General Characterization of Agents by their Behaviour [18.488655590845163]
We propose policy supervectors for characterizing agents by the distribution of states they visit. Policy supervectors can characterize policies regardless of their design philosophy and scale to thousands of policies on a single workstation machine. We demonstrate method's applicability by studying the evolution of policies during reinforcement learning, evolutionary training and imitation learning.
arXiv Detail & Related papers (2020-12-02T14:43:16Z)
Reward-Conditioned Policies [100.64167842905069]
imitation learning requires near-optimal expert data. Can we learn effective policies via supervised learning without demonstrations? We show how such an approach can be derived as a principled method for policy search.
arXiv Detail & Related papers (2019-12-31T18:07:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.