Learnable Behavior Control: Breaking Atari Human World Records via
Sample-Efficient Behavior Selection
- URL: http://arxiv.org/abs/2305.05239v1
- Date: Tue, 9 May 2023 08:00:23 GMT
- Title: Learnable Behavior Control: Breaking Atari Human World Records via
Sample-Efficient Behavior Selection
- Authors: Jiajun Fan, Yuzheng Zhuang, Yuecheng Liu, Jianye Hao, Bin Wang,
Jiangcheng Zhu, Hao Wang, Shu-Tao Xia
- Abstract summary: We propose a general framework called Learnable Behavioral Control (LBC) to address the limitation.
Our agents have achieved 10077.52% mean human normalized score and surpassed 24 human world records within 1B training frames.
- Score: 56.87650511573298
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The exploration problem is one of the main challenges in deep reinforcement
learning (RL). Recent promising works tried to handle the problem with
population-based methods, which collect samples with diverse behaviors derived
from a population of different exploratory policies. Adaptive policy selection
has been adopted for behavior control. However, the behavior selection space is
largely limited by the predefined policy population, which further limits
behavior diversity. In this paper, we propose a general framework called
Learnable Behavioral Control (LBC) to address the limitation, which a) enables
a significantly enlarged behavior selection space via formulating a hybrid
behavior mapping from all policies; b) constructs a unified learnable process
for behavior selection. We introduce LBC into distributed off-policy
actor-critic methods and achieve behavior control via optimizing the selection
of the behavior mappings with bandit-based meta-controllers. Our agents have
achieved 10077.52% mean human normalized score and surpassed 24 human world
records within 1B training frames in the Arcade Learning Environment, which
demonstrates our significant state-of-the-art (SOTA) performance without
degrading the sample efficiency.
Related papers
- How Generalizable Is My Behavior Cloning Policy? A Statistical Approach to Trustworthy Performance Evaluation [17.638831964639834]
Behavior cloning policies are increasingly successful at solving complex tasks by learning from human demonstrations.
We present a framework that provides a tight lower-bound on robot performance in an arbitrary environment.
In experiments we evaluate policies for visuomotor manipulation in both simulation and hardware.
arXiv Detail & Related papers (2024-05-08T22:00:35Z) - Hundreds Guide Millions: Adaptive Offline Reinforcement Learning with
Expert Guidance [74.31779732754697]
We propose a novel plug-in approach named Guided Offline RL (GORL)
GORL employs a guiding network, along with only a few expert demonstrations, to adaptively determine the relative importance of the policy improvement and policy constraint for every sample.
Experiments on various environments suggest that GORL can be easily installed on most offline RL algorithms with statistically significant performance improvements.
arXiv Detail & Related papers (2023-09-04T08:59:04Z) - Provably Efficient UCB-type Algorithms For Learning Predictive State
Representations [55.00359893021461]
The sequential decision-making problem is statistically learnable if it admits a low-rank structure modeled by predictive state representations (PSRs)
This paper proposes the first known UCB-type approach for PSRs, featuring a novel bonus term that upper bounds the total variation distance between the estimated and true models.
In contrast to existing approaches for PSRs, our UCB-type algorithms enjoy computational tractability, last-iterate guaranteed near-optimal policy, and guaranteed model accuracy.
arXiv Detail & Related papers (2023-07-01T18:35:21Z) - Reinforcement Learning from Diverse Human Preferences [68.4294547285359]
This paper develops a method for crowd-sourcing preference labels and learning from diverse human preferences.
The proposed method is tested on a variety of tasks in DMcontrol and Meta-world.
It has shown consistent and significant improvements over existing preference-based RL algorithms when learning from diverse feedback.
arXiv Detail & Related papers (2023-01-27T15:18:54Z) - ABC: Adversarial Behavioral Cloning for Offline Mode-Seeking Imitation
Learning [48.033516430071494]
We introduce a modified version of behavioral cloning (BC) that exhibits mode-seeking behavior by incorporating elements of GAN (generative adversarial network) training.
We evaluate ABC on toy domains and a domain based on Hopper from the DeepMind Control suite, and show that it outperforms standard BC by being mode-seeking in nature.
arXiv Detail & Related papers (2022-11-08T04:54:54Z) - CAMEO: Curiosity Augmented Metropolis for Exploratory Optimal Policies [62.39667564455059]
We consider and study a distribution of optimal policies.
In experimental simulations we show that CAMEO indeed obtains policies that all solve classic control problems.
We further show that the different policies we sample present different risk profiles, corresponding to interesting practical applications in interpretability.
arXiv Detail & Related papers (2022-05-19T09:48:56Z) - Externally Valid Policy Choice [0.0]
We consider the problem of learning personalized treatment policies that are externally valid or generalizable.
We first show that welfare-maximizing policies for the experimental population are robust to shifts in the distribution of outcomes.
We then develop new methods for learning policies that are robust to shifts in outcomes and characteristics.
arXiv Detail & Related papers (2022-05-11T15:19:22Z) - Learning Complex Spatial Behaviours in ABM: An Experimental
Observational Study [0.0]
This paper explores how Reinforcement Learning can be applied to create emergent agent behaviours.
Running a series of simulations, we demonstrate that agents trained using the novel Proximal Policy optimisation algorithm behave in ways that exhibit properties of real-world intelligent adaptive behaviours.
arXiv Detail & Related papers (2022-01-04T11:56:11Z) - Improving Generalization in Reinforcement Learning with Mixture
Regularization [113.12412071717078]
We introduce a simple approach, named mixreg, which trains agents on a mixture of observations from different training environments.
Mixreg increases the data diversity more effectively and helps learn smoother policies.
Results show mixreg outperforms the well-established baselines on unseen testing environments by a large margin.
arXiv Detail & Related papers (2020-10-21T08:12:03Z) - Unified Models of Human Behavioral Agents in Bandits, Contextual Bandits
and RL [28.38826379640553]
We propose a more general and flexible parametric framework for sequential decision making.
Inspired by the known reward processing abnormalities of many mental disorders, our clinically-inspired agents demonstrated interesting behavioral trajectories.
arXiv Detail & Related papers (2020-05-10T01:43:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.