Related papers: Learnable Behavior Control: Breaking Atari Human World Records via Sample-Efficient Behavior Selection

Learnable Behavior Control: Breaking Atari Human World Records via Sample-Efficient Behavior Selection

URL: http://arxiv.org/abs/2305.05239v2
Date: Mon, 27 Oct 2025 04:29:26 GMT
Title: Learnable Behavior Control: Breaking Atari Human World Records via Sample-Efficient Behavior Selection
Authors: Jiajun Fan, Yuzheng Zhuang, Yuecheng Liu, Jianye Hao, Bin Wang, Jiangcheng Zhu, Hao Wang, Shu-Tao Xia,
Abstract summary: We propose a general framework called Learnable Behavioral Control (LBC) to address the limitation.<n>Our agents have achieved 10077.52% mean human normalized score and surpassed 24 human world records within 1B training frames.
Score: 80.35510218548693
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The exploration problem is one of the main challenges in deep reinforcement learning (RL). Recent promising works tried to handle the problem with population-based methods, which collect samples with diverse behaviors derived from a population of different exploratory policies. Adaptive policy selection has been adopted for behavior control. However, the behavior selection space is largely limited by the predefined policy population, which further limits behavior diversity. In this paper, we propose a general framework called Learnable Behavioral Control (LBC) to address the limitation, which a) enables a significantly enlarged behavior selection space via formulating a hybrid behavior mapping from all policies; b) constructs a unified learnable process for behavior selection. We introduce LBC into distributed off-policy actor-critic methods and achieve behavior control via optimizing the selection of the behavior mappings with bandit-based meta-controllers. Our agents have achieved 10077.52% mean human normalized score and surpassed 24 human world records within 1B training frames in the Arcade Learning Environment, which demonstrates our significant state-of-the-art (SOTA) performance without degrading the sample efficiency.

Related papers

OmniSapiens: A Foundation Model for Social Behavior Processing via Heterogeneity-Aware Relative Policy Optimization [50.11607985532808]
We introduce Heterogeneity-Aware Relative Policy Optimization (HARPO), an RL method that balances leaning across heterogeneous tasks and samples.<n>Using HARPO, we develop and release Omnisapiens-7B 2.0, a foundation model for social behavior processing.<n>Relative to existing behavioral foundation models, Omnisapiens-7B 2.0 achieves the strongest performance across behavioral tasks.
arXiv Detail & Related papers (2026-02-11T08:35:59Z)
Categorical Policies: Multimodal Policy Learning and Exploration in Continuous Control [1.7495213911983414]
We introduce Categorical Policies to model multimodal behavior modes with an intermediate categorical distribution.<n>By utilizing a latent categorical distribution to select the behavior mode, our approach naturally expresses multimodality while remaining fully differentiable via the sampling tricks.<n>Our results indicate that the Categorical distribution serves as a powerful tool for structured exploration and multimodal behavior representation in continuous control.
arXiv Detail & Related papers (2025-08-19T15:18:01Z)
PB$^2$: Preference Space Exploration via Population-Based Methods in Preference-Based Reinforcement Learning [2.0373030742807545]
We identify and address this preference exploration problem through population-based methods.<n>We demonstrate that maintaining a diverse population of agents enables more comprehensive exploration of the preference landscape.<n>This diversity improves reward model learning by generating preference queries with clearly distinguishable behaviors.
arXiv Detail & Related papers (2025-06-16T17:51:33Z)
Offline Learning of Controllable Diverse Behaviors [19.0544729496907]
Imitation Learning (IL) techniques aim to replicate human behaviors in specific tasks. We propose a new method based on temporal consistency and controllability. We compare our approach to state-of-the-art methods over a diverse set of tasks and environments.
arXiv Detail & Related papers (2025-04-25T08:16:56Z)
Zero-Shot Whole-Body Humanoid Control via Behavioral Foundation Models [71.34520793462069]
Unsupervised reinforcement learning (RL) aims at pre-training agents that can solve a wide range of downstream tasks in complex environments. We introduce a novel algorithm regularizing unsupervised RL towards imitating trajectories from unlabeled behavior datasets. We demonstrate the effectiveness of this new approach in a challenging humanoid control problem.
arXiv Detail & Related papers (2025-04-15T10:41:11Z)
Diversifying Policy Behaviors with Extrinsic Behavioral Curiosity [27.272921087408164]
This work introduces Quality Diversity Inverse Reinforcement Learning (QD-IRL) and Extrinsic Behavioral Curiosity (EBC)<n>QD-IRL integrates quality-diversity optimization with IRL methods, enabling agents to learn diverse behaviors from limited demonstrations.<n>EBC allows agents to receive additional curiosity rewards from an external critic based on how novel the behaviors are.
arXiv Detail & Related papers (2024-10-08T15:49:33Z)
How Generalizable Is My Behavior Cloning Policy? A Statistical Approach to Trustworthy Performance Evaluation [17.638831964639834]
Behavior cloning policies are increasingly successful at solving complex tasks by learning from human demonstrations. We present a framework that provides a tight lower-bound on robot performance in an arbitrary environment. In experiments we evaluate policies for visuomotor manipulation in both simulation and hardware.
arXiv Detail & Related papers (2024-05-08T22:00:35Z)
Hundreds Guide Millions: Adaptive Offline Reinforcement Learning with Expert Guidance [74.31779732754697]
We propose a novel plug-in approach named Guided Offline RL (GORL) GORL employs a guiding network, along with only a few expert demonstrations, to adaptively determine the relative importance of the policy improvement and policy constraint for every sample. Experiments on various environments suggest that GORL can be easily installed on most offline RL algorithms with statistically significant performance improvements.
arXiv Detail & Related papers (2023-09-04T08:59:04Z)
Provably Efficient UCB-type Algorithms For Learning Predictive State Representations [55.00359893021461]
The sequential decision-making problem is statistically learnable if it admits a low-rank structure modeled by predictive state representations (PSRs) This paper proposes the first known UCB-type approach for PSRs, featuring a novel bonus term that upper bounds the total variation distance between the estimated and true models. In contrast to existing approaches for PSRs, our UCB-type algorithms enjoy computational tractability, last-iterate guaranteed near-optimal policy, and guaranteed model accuracy.
arXiv Detail & Related papers (2023-07-01T18:35:21Z)
Reinforcement Learning from Diverse Human Preferences [68.4294547285359]
This paper develops a method for crowd-sourcing preference labels and learning from diverse human preferences. The proposed method is tested on a variety of tasks in DMcontrol and Meta-world. It has shown consistent and significant improvements over existing preference-based RL algorithms when learning from diverse feedback.
arXiv Detail & Related papers (2023-01-27T15:18:54Z)
ABC: Adversarial Behavioral Cloning for Offline Mode-Seeking Imitation Learning [48.033516430071494]
We introduce a modified version of behavioral cloning (BC) that exhibits mode-seeking behavior by incorporating elements of GAN (generative adversarial network) training. We evaluate ABC on toy domains and a domain based on Hopper from the DeepMind Control suite, and show that it outperforms standard BC by being mode-seeking in nature.
arXiv Detail & Related papers (2022-11-08T04:54:54Z)
CAMEO: Curiosity Augmented Metropolis for Exploratory Optimal Policies [62.39667564455059]
We consider and study a distribution of optimal policies. In experimental simulations we show that CAMEO indeed obtains policies that all solve classic control problems. We further show that the different policies we sample present different risk profiles, corresponding to interesting practical applications in interpretability.
arXiv Detail & Related papers (2022-05-19T09:48:56Z)
Externally Valid Policy Choice [0.0]
We consider the problem of learning personalized treatment policies that are externally valid or generalizable. We first show that welfare-maximizing policies for the experimental population are robust to shifts in the distribution of outcomes. We then develop new methods for learning policies that are robust to shifts in outcomes and characteristics.
arXiv Detail & Related papers (2022-05-11T15:19:22Z)
Learning Complex Spatial Behaviours in ABM: An Experimental Observational Study [0.0]
This paper explores how Reinforcement Learning can be applied to create emergent agent behaviours. Running a series of simulations, we demonstrate that agents trained using the novel Proximal Policy optimisation algorithm behave in ways that exhibit properties of real-world intelligent adaptive behaviours.
arXiv Detail & Related papers (2022-01-04T11:56:11Z)
Improving Generalization in Reinforcement Learning with Mixture Regularization [113.12412071717078]
We introduce a simple approach, named mixreg, which trains agents on a mixture of observations from different training environments. Mixreg increases the data diversity more effectively and helps learn smoother policies. Results show mixreg outperforms the well-established baselines on unseen testing environments by a large margin.
arXiv Detail & Related papers (2020-10-21T08:12:03Z)
Unified Models of Human Behavioral Agents in Bandits, Contextual Bandits and RL [28.38826379640553]
We propose a more general and flexible parametric framework for sequential decision making. Inspired by the known reward processing abnormalities of many mental disorders, our clinically-inspired agents demonstrated interesting behavioral trajectories.
arXiv Detail & Related papers (2020-05-10T01:43:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.