Replay-Guided Adversarial Environment Design
- URL: http://arxiv.org/abs/2110.02439v1
- Date: Wed, 6 Oct 2021 01:01:39 GMT
- Title: Replay-Guided Adversarial Environment Design
- Authors: Minqi Jiang, Michael Dennis, Jack Parker-Holder, Jakob Foerster,
Edward Grefenstette, Tim Rockt\"aschel
- Abstract summary: We argue that by curating completely random levels, PLR can generate novel and complex levels for effective training.
We show that our new method, PLR$perp$, obtains better results on a suite of out-of-distribution, zero-shot transfer tasks.
- Score: 21.305857977725886
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Deep reinforcement learning (RL) agents may successfully generalize to new
settings if trained on an appropriately diverse set of environment and task
configurations. Unsupervised Environment Design (UED) is a promising
self-supervised RL paradigm, wherein the free parameters of an underspecified
environment are automatically adapted during training to the agent's
capabilities, leading to the emergence of diverse training environments. Here,
we cast Prioritized Level Replay (PLR), an empirically successful but
theoretically unmotivated method that selectively samples randomly-generated
training levels, as UED. We argue that by curating completely random levels,
PLR, too, can generate novel and complex levels for effective training. This
insight reveals a natural class of UED methods we call Dual Curriculum Design
(DCD). Crucially, DCD includes both PLR and a popular UED algorithm, PAIRED, as
special cases and inherits similar theoretical guarantees. This connection
allows us to develop novel theory for PLR, providing a version with a
robustness guarantee at Nash equilibria. Furthermore, our theory suggests a
highly counterintuitive improvement to PLR: by stopping the agent from updating
its policy on uncurated levels (training on less data), we can improve the
convergence to Nash equilibria. Indeed, our experiments confirm that our new
method, PLR$^{\perp}$, obtains better results on a suite of
out-of-distribution, zero-shot transfer tasks, in addition to demonstrating
that PLR$^{\perp}$ improves the performance of PAIRED, from which it inherited
its theoretical framework.
Related papers
- REBEL: Reinforcement Learning via Regressing Relative Rewards [59.68420022466047]
We propose REBEL, a minimalist RL algorithm for the era of generative models.
In theory, we prove that fundamental RL algorithms like Natural Policy Gradient can be seen as variants of REBEL.
We find that REBEL provides a unified approach to language modeling and image generation with stronger or similar performance as PPO and DPO.
arXiv Detail & Related papers (2024-04-25T17:20:45Z) - DRED: Zero-Shot Transfer in Reinforcement Learning via Data-Regularised Environment Design [11.922951794283168]
In this work, we investigate how the sampling of individual environment instances, or levels, affects the zero-shot generalisation (ZSG) ability of RL agents.
We discover that for deep actor-critic architectures sharing their base layers, prioritising levels according to their value loss minimises the mutual information between the agent's internal representation and the set of training levels in the generated training data.
We find that existing UED methods can significantly shift the training distribution, which translates to low ZSG performance.
To prevent both overfitting and distributional shift, we introduce data-regularised environment design (D
arXiv Detail & Related papers (2024-02-05T19:47:45Z) - Provable Guarantees for Generative Behavior Cloning: Bridging Low-Level
Stability and High-Level Behavior [51.60683890503293]
We propose a theoretical framework for studying behavior cloning of complex expert demonstrations using generative modeling.
We show that pure supervised cloning can generate trajectories matching the per-time step distribution of arbitrary expert trajectories.
arXiv Detail & Related papers (2023-07-27T04:27:26Z) - PEAR: Primitive enabled Adaptive Relabeling for boosting Hierarchical Reinforcement Learning [25.84621883831624]
Hierarchical reinforcement learning has the potential to solve complex long horizon tasks using temporal abstraction and increased exploration.
We present primitive enabled adaptive relabeling (PEAR)
We first perform adaptive relabeling on a few expert demonstrations to generate efficient subgoal supervision.
We then jointly optimize HRL agents by employing reinforcement learning (RL) and imitation learning (IL)
arXiv Detail & Related papers (2023-06-10T09:41:30Z) - Provable Reward-Agnostic Preference-Based Reinforcement Learning [61.39541986848391]
Preference-based Reinforcement Learning (PbRL) is a paradigm in which an RL agent learns to optimize a task using pair-wise preference-based feedback over trajectories.
We propose a theoretical reward-agnostic PbRL framework where exploratory trajectories that enable accurate learning of hidden reward functions are acquired.
arXiv Detail & Related papers (2023-05-29T15:00:09Z) - On Practical Robust Reinforcement Learning: Practical Uncertainty Set
and Double-Agent Algorithm [11.748284119769039]
Robust reinforcement learning (RRL) aims at seeking a robust policy to optimize the worst case performance over an uncertainty set of Markov decision processes (MDPs)
arXiv Detail & Related papers (2023-05-11T08:52:09Z) - When Demonstrations Meet Generative World Models: A Maximum Likelihood
Framework for Offline Inverse Reinforcement Learning [62.00672284480755]
This paper aims to recover the structure of rewards and environment dynamics that underlie observed actions in a fixed, finite set of demonstrations from an expert agent.
Accurate models of expertise in executing a task has applications in safety-sensitive applications such as clinical decision making and autonomous driving.
arXiv Detail & Related papers (2023-02-15T04:14:20Z) - Single-Trajectory Distributionally Robust Reinforcement Learning [13.013268095049236]
Reinforcement Learning (RL) has been regarded as an essential component leading to Artificial General Intelligence (AGI)
However, RL is often criticized for having the same training environment as the test one, which also hinders its application in the real world.
To mitigate this problem, Distributionally Robust RL (DRRL) is proposed to improve the worst performance in a set of environments that may contain the unknown test environment.
arXiv Detail & Related papers (2023-01-27T14:08:09Z) - Grounding Aleatoric Uncertainty in Unsupervised Environment Design [32.00797965770773]
In partially-observable settings, optimal policies may depend on the ground-truth distribution over aleatoric parameters of the environment.
We propose a minimax regret UED method that optimize the ground-truth utility function, even when the underlying training data is biased due to CICS.
arXiv Detail & Related papers (2022-07-11T22:45:29Z) - Distributionally Robust Models with Parametric Likelihood Ratios [123.05074253513935]
Three simple ideas allow us to train models with DRO using a broader class of parametric likelihood ratios.
We find that models trained with the resulting parametric adversaries are consistently more robust to subpopulation shifts when compared to other DRO approaches.
arXiv Detail & Related papers (2022-04-13T12:43:12Z) - Towards Scaling Difference Target Propagation by Learning Backprop
Targets [64.90165892557776]
Difference Target Propagation is a biologically-plausible learning algorithm with close relation with Gauss-Newton (GN) optimization.
We propose a novel feedback weight training scheme that ensures both that DTP approximates BP and that layer-wise feedback weight training can be restored.
We report the best performance ever achieved by DTP on CIFAR-10 and ImageNet.
arXiv Detail & Related papers (2022-01-31T18:20:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.