Generalized Data Distribution Iteration
- URL: http://arxiv.org/abs/2206.03192v1
- Date: Tue, 7 Jun 2022 11:27:40 GMT
- Title: Generalized Data Distribution Iteration
- Authors: Jiajun Fan, Changnan Xiao
- Abstract summary: We tackle data richness and exploration-exploitation trade-off simultaneously in deep reinforcement learning.
We introduce operator-based versions of well-known RL methods from DQN to Agent57.
Our algorithm has achieved 9620.33% mean human normalized score (HNS), 1146.39% median HNS and surpassed 22 human world records using only 200M training frames.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: To obtain higher sample efficiency and superior final performance
simultaneously has been one of the major challenges for deep reinforcement
learning (DRL). Previous work could handle one of these challenges but
typically failed to address them concurrently. In this paper, we try to tackle
these two challenges simultaneously. To achieve this, we firstly decouple these
challenges into two classic RL problems: data richness and
exploration-exploitation trade-off. Then, we cast these two problems into the
training data distribution optimization problem, namely to obtain desired
training data within limited interactions, and address them concurrently via i)
explicit modeling and control of the capacity and diversity of behavior policy
and ii) more fine-grained and adaptive control of selective/sampling
distribution of the behavior policy using a monotonic data distribution
optimization. Finally, we integrate this process into Generalized Policy
Iteration (GPI) and obtain a more general framework called Generalized Data
Distribution Iteration (GDI). We use the GDI framework to introduce
operator-based versions of well-known RL methods from DQN to Agent57.
Theoretical guarantee of the superiority of GDI compared with GPI is concluded.
We also demonstrate our state-of-the-art (SOTA) performance on Arcade Learning
Environment (ALE), wherein our algorithm has achieved 9620.33% mean human
normalized score (HNS), 1146.39% median HNS and surpassed 22 human world
records using only 200M training frames. Our performance is comparable to
Agent57's while we consume 500 times less data. We argue that there is still a
long way to go before obtaining real superhuman agents in ALE.
Related papers
- Scaling Offline Model-Based RL via Jointly-Optimized World-Action Model Pretraining [49.730897226510095]
We introduce JOWA: Jointly-Reinforced World-Action model, an offline model-based RL agent pretrained on Atari games with 6 billion tokens data.
Our largest agent, with 150 million parameters, 78.9% human-level performance on pretrained games using only 10% subsampled offline data, outperforming existing state-of-the-art large-scale offline RL baselines by 31.6% on averange.
arXiv Detail & Related papers (2024-10-01T10:25:03Z) - SMaRt: Improving GANs with Score Matching Regularity [94.81046452865583]
Generative adversarial networks (GANs) usually struggle in learning from highly diverse data, whose underlying manifold is complex.
We show that score matching serves as a promising solution to this issue thanks to its capability of persistently pushing the generated data points towards the real data manifold.
We propose to improve the optimization of GANs with score matching regularity (SMaRt)
arXiv Detail & Related papers (2023-11-30T03:05:14Z) - Improving Generalization of Alignment with Human Preferences through
Group Invariant Learning [56.19242260613749]
Reinforcement Learning from Human Feedback (RLHF) enables the generation of responses more aligned with human preferences.
Previous work shows that Reinforcement Learning (RL) often exploits shortcuts to attain high rewards and overlooks challenging samples.
We propose a novel approach that can learn a consistent policy via RL across various data groups or domains.
arXiv Detail & Related papers (2023-10-18T13:54:15Z) - Mastering the Unsupervised Reinforcement Learning Benchmark from Pixels [112.63440666617494]
Reinforcement learning algorithms can succeed but require large amounts of interactions between the agent and the environment.
We propose a new method to solve it, using unsupervised model-based RL, for pre-training the agent.
We show robust performance on the Real-Word RL benchmark, hinting at resiliency to environment perturbations during adaptation.
arXiv Detail & Related papers (2022-09-24T14:22:29Z) - Revisiting Gaussian mixture critics in off-policy reinforcement
learning: a sample-based approach [28.199348547856175]
This paper revisits a natural alternative that removes the requirement of prior knowledge about the minimum and values a policy can attain.
It achieves state-of-the-art performance on a variety of challenging tasks.
arXiv Detail & Related papers (2022-04-21T16:44:47Z) - GRI: General Reinforced Imitation and its Application to Vision-Based
Autonomous Driving [9.030769176986057]
General Reinforced Imitation (GRI) is a novel method which combines benefits from exploration and expert data.
We show that our approach enables major improvements on vision-based autonomous driving in urban environments.
arXiv Detail & Related papers (2021-11-16T15:52:54Z) - Behavioral Priors and Dynamics Models: Improving Performance and Domain
Transfer in Offline RL [82.93243616342275]
We introduce Offline Model-based RL with Adaptive Behavioral Priors (MABE)
MABE is based on the finding that dynamics models, which support within-domain generalization, and behavioral priors, which support cross-domain generalization, are complementary.
In experiments that require cross-domain generalization, we find that MABE outperforms prior methods.
arXiv Detail & Related papers (2021-06-16T20:48:49Z) - GDI: Rethinking What Makes Reinforcement Learning Different From
Supervised Learning [8.755783981297396]
We extend the basic paradigm of RL called the Generalized Policy Iteration (GPI) into a more generalized version, which is called the Generalized Data Distribution Iteration (GDI)
Our algorithm has achieved 9620.98% mean human normalized score (HNS), 1146.39% median HNS and 22 human world record breakthroughs (HWRB) using only 200 training frames.
arXiv Detail & Related papers (2021-06-11T08:31:12Z) - Regularizing Generative Adversarial Networks under Limited Data [88.57330330305535]
This work proposes a regularization approach for training robust GAN models on limited data.
We show a connection between the regularized loss and an f-divergence called LeCam-divergence, which we find is more robust under limited training data.
arXiv Detail & Related papers (2021-04-07T17:59:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.