Related papers: Replay-Guided Adversarial Environment Design

Replay-Guided Adversarial Environment Design

URL: http://arxiv.org/abs/2110.02439v1
Date: Wed, 6 Oct 2021 01:01:39 GMT
Title: Replay-Guided Adversarial Environment Design
Authors: Minqi Jiang, Michael Dennis, Jack Parker-Holder, Jakob Foerster, Edward Grefenstette, Tim Rockt\"aschel
Abstract summary: We argue that by curating completely random levels, PLR can generate novel and complex levels for effective training. We show that our new method, PLR$perp$, obtains better results on a suite of out-of-distribution, zero-shot transfer tasks.
Score: 21.305857977725886
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Deep reinforcement learning (RL) agents may successfully generalize to new settings if trained on an appropriately diverse set of environment and task configurations. Unsupervised Environment Design (UED) is a promising self-supervised RL paradigm, wherein the free parameters of an underspecified environment are automatically adapted during training to the agent's capabilities, leading to the emergence of diverse training environments. Here, we cast Prioritized Level Replay (PLR), an empirically successful but theoretically unmotivated method that selectively samples randomly-generated training levels, as UED. We argue that by curating completely random levels, PLR, too, can generate novel and complex levels for effective training. This insight reveals a natural class of UED methods we call Dual Curriculum Design (DCD). Crucially, DCD includes both PLR and a popular UED algorithm, PAIRED, as special cases and inherits similar theoretical guarantees. This connection allows us to develop novel theory for PLR, providing a version with a robustness guarantee at Nash equilibria. Furthermore, our theory suggests a highly counterintuitive improvement to PLR: by stopping the agent from updating its policy on uncurated levels (training on less data), we can improve the convergence to Nash equilibria. Indeed, our experiments confirm that our new method, PLR$^{\perp}$, obtains better results on a suite of out-of-distribution, zero-shot transfer tasks, in addition to demonstrating that PLR$^{\perp}$ improves the performance of PAIRED, from which it inherited its theoretical framework.

Related papers

Fast Adaptation with Behavioral Foundation Models [82.34700481726951]
Unsupervised zero-shot reinforcement learning has emerged as a powerful paradigm for pretraining behavioral foundation models. Despite promising results, zero-shot policies are often suboptimal due to errors induced by the unsupervised training process. We propose fast adaptation strategies that search in the low-dimensional task-embedding space of the pre-trained BFM to rapidly improve the performance of its zero-shot policies.
arXiv Detail & Related papers (2025-04-10T16:14:17Z)
Can RLHF be More Efficient with Imperfect Reward Models? A Policy Coverage Perspective [31.956232187102465]
This paper studies how to transfer knowledge from imperfect reward models in online RLHF.<n>We propose novel transfer learning principles and a theoretical algorithm.<n>We develop a win-rate-based transfer policy selection strategy with improved computational efficiency.
arXiv Detail & Related papers (2025-02-26T16:03:06Z)
Zeroth-order Informed Fine-Tuning for Diffusion Model: A Recursive Likelihood Ratio Optimizer [9.153197757307762]
probabilistic diffusion model (DM) is a powerful framework for visual generation. How to efficiently align the foundation DM is a crucial task. We propose the Recursive Likelihood Ratio (RLR), a zeroth-order informed fine-tuning paradigm for DM.
arXiv Detail & Related papers (2025-02-02T03:00:26Z)
REBEL: Reinforcement Learning via Regressing Relative Rewards [59.68420022466047]
We propose REBEL, a minimalist RL algorithm for the era of generative models. In theory, we prove that fundamental RL algorithms like Natural Policy Gradient can be seen as variants of REBEL. We find that REBEL provides a unified approach to language modeling and image generation with stronger or similar performance as PPO and DPO.
arXiv Detail & Related papers (2024-04-25T17:20:45Z)
DRED: Zero-Shot Transfer in Reinforcement Learning via Data-Regularised Environment Design [11.922951794283168]
In this work, we investigate how the sampling of individual environment instances, or levels, affects the zero-shot generalisation (ZSG) ability of RL agents. We discover that for deep actor-critic architectures sharing their base layers, prioritising levels according to their value loss minimises the mutual information between the agent's internal representation and the set of training levels in the generated training data. We find that existing UED methods can significantly shift the training distribution, which translates to low ZSG performance. To prevent both overfitting and distributional shift, we introduce data-regularised environment design (D
arXiv Detail & Related papers (2024-02-05T19:47:45Z)
Provable Guarantees for Generative Behavior Cloning: Bridging Low-Level Stability and High-Level Behavior [51.60683890503293]
We propose a theoretical framework for studying behavior cloning of complex expert demonstrations using generative modeling. We show that pure supervised cloning can generate trajectories matching the per-time step distribution of arbitrary expert trajectories.
arXiv Detail & Related papers (2023-07-27T04:27:26Z)
PEAR: Primitive enabled Adaptive Relabeling for boosting Hierarchical Reinforcement Learning [25.84621883831624]
Hierarchical reinforcement learning has the potential to solve complex long horizon tasks using temporal abstraction and increased exploration. We present primitive enabled adaptive relabeling (PEAR) We first perform adaptive relabeling on a few expert demonstrations to generate efficient subgoal supervision. We then jointly optimize HRL agents by employing reinforcement learning (RL) and imitation learning (IL)
arXiv Detail & Related papers (2023-06-10T09:41:30Z)
Provable Reward-Agnostic Preference-Based Reinforcement Learning [61.39541986848391]
Preference-based Reinforcement Learning (PbRL) is a paradigm in which an RL agent learns to optimize a task using pair-wise preference-based feedback over trajectories. We propose a theoretical reward-agnostic PbRL framework where exploratory trajectories that enable accurate learning of hidden reward functions are acquired.
arXiv Detail & Related papers (2023-05-29T15:00:09Z)
On Practical Robust Reinforcement Learning: Practical Uncertainty Set and Double-Agent Algorithm [11.748284119769039]
Robust reinforcement learning (RRL) aims at seeking a robust policy to optimize the worst case performance over an uncertainty set of Markov decision processes (MDPs)
arXiv Detail & Related papers (2023-05-11T08:52:09Z)
When Demonstrations Meet Generative World Models: A Maximum Likelihood Framework for Offline Inverse Reinforcement Learning [62.00672284480755]
This paper aims to recover the structure of rewards and environment dynamics that underlie observed actions in a fixed, finite set of demonstrations from an expert agent. Accurate models of expertise in executing a task has applications in safety-sensitive applications such as clinical decision making and autonomous driving.
arXiv Detail & Related papers (2023-02-15T04:14:20Z)
Grounding Aleatoric Uncertainty in Unsupervised Environment Design [32.00797965770773]
In partially-observable settings, optimal policies may depend on the ground-truth distribution over aleatoric parameters of the environment. We propose a minimax regret UED method that optimize the ground-truth utility function, even when the underlying training data is biased due to CICS.
arXiv Detail & Related papers (2022-07-11T22:45:29Z)
Distributionally Robust Models with Parametric Likelihood Ratios [123.05074253513935]
Three simple ideas allow us to train models with DRO using a broader class of parametric likelihood ratios. We find that models trained with the resulting parametric adversaries are consistently more robust to subpopulation shifts when compared to other DRO approaches.
arXiv Detail & Related papers (2022-04-13T12:43:12Z)
Towards Scaling Difference Target Propagation by Learning Backprop Targets [64.90165892557776]
Difference Target Propagation is a biologically-plausible learning algorithm with close relation with Gauss-Newton (GN) optimization. We propose a novel feedback weight training scheme that ensures both that DTP approximates BP and that layer-wise feedback weight training can be restored. We report the best performance ever achieved by DTP on CIFAR-10 and ImageNet.
arXiv Detail & Related papers (2022-01-31T18:20:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.