Grounding Aleatoric Uncertainty in Unsupervised Environment Design
- URL: http://arxiv.org/abs/2207.05219v1
- Date: Mon, 11 Jul 2022 22:45:29 GMT
- Title: Grounding Aleatoric Uncertainty in Unsupervised Environment Design
- Authors: Minqi Jiang, Michael Dennis, Jack Parker-Holder, Andrei Lupu, Heinrich
K\"uttler, Edward Grefenstette, Tim Rockt\"aschel, Jakob Foerster
- Abstract summary: In partially-observable settings, optimal policies may depend on the ground-truth distribution over aleatoric parameters of the environment.
We propose a minimax regret UED method that optimize the ground-truth utility function, even when the underlying training data is biased due to CICS.
- Score: 32.00797965770773
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Adaptive curricula in reinforcement learning (RL) have proven effective for
producing policies robust to discrepancies between the train and test
environment. Recently, the Unsupervised Environment Design (UED) framework
generalized RL curricula to generating sequences of entire environments,
leading to new methods with robust minimax regret properties. Problematically,
in partially-observable or stochastic settings, optimal policies may depend on
the ground-truth distribution over aleatoric parameters of the environment in
the intended deployment setting, while curriculum learning necessarily shifts
the training distribution. We formalize this phenomenon as curriculum-induced
covariate shift (CICS), and describe how its occurrence in aleatoric parameters
can lead to suboptimal policies. Directly sampling these parameters from the
ground-truth distribution avoids the issue, but thwarts curriculum learning. We
propose SAMPLR, a minimax regret UED method that optimizes the ground-truth
utility function, even when the underlying training data is biased due to CICS.
We prove, and validate on challenging domains, that our approach preserves
optimality under the ground-truth distribution, while promoting robustness
across the full range of environment settings.
Related papers
- Survival of the Fittest: Evolutionary Adaptation of Policies for Environmental Shifts [0.15889427269227555]
We develop an adaptive re-training algorithm inspired by evolutionary game theory (EGT)
ERPO shows faster policy adaptation, higher average rewards, and reduced computational costs in policy adaptation.
arXiv Detail & Related papers (2024-10-22T09:29:53Z) - Certifiably Robust Policies for Uncertain Parametric Environments [57.2416302384766]
We propose a framework based on parametric Markov decision processes (MDPs) with unknown distributions over parameters.
We learn and analyse IMDPs for a set of unknown sample environments induced by parameters.
We show that our approach produces tight bounds on a policy's performance with high confidence.
arXiv Detail & Related papers (2024-08-06T10:48:15Z) - Learning Optimal Deterministic Policies with Stochastic Policy Gradients [62.81324245896716]
Policy gradient (PG) methods are successful approaches to deal with continuous reinforcement learning (RL) problems.
In common practice, convergence (hyper)policies are learned only to deploy their deterministic version.
We show how to tune the exploration level used for learning to optimize the trade-off between the sample complexity and the performance of the deployed deterministic policy.
arXiv Detail & Related papers (2024-05-03T16:45:15Z) - DRED: Zero-Shot Transfer in Reinforcement Learning via Data-Regularised Environment Design [11.922951794283168]
In this work, we investigate how the sampling of individual environment instances, or levels, affects the zero-shot generalisation (ZSG) ability of RL agents.
We discover that for deep actor-critic architectures sharing their base layers, prioritising levels according to their value loss minimises the mutual information between the agent's internal representation and the set of training levels in the generated training data.
We find that existing UED methods can significantly shift the training distribution, which translates to low ZSG performance.
To prevent both overfitting and distributional shift, we introduce data-regularised environment design (D
arXiv Detail & Related papers (2024-02-05T19:47:45Z) - Wasserstein Distributionally Robust Policy Evaluation and Learning for
Contextual Bandits [18.982448033389588]
Off-policy evaluation and learning are concerned with assessing a given policy and learning an optimal policy from offline data without direct interaction with the environment.
To account for the effect of different environments during learning and execution, distributionally robust optimization (DRO) methods have been developed.
We propose a novel DRO approach that employs the Wasserstein distance instead.
arXiv Detail & Related papers (2023-09-15T20:21:46Z) - Reparameterized Policy Learning for Multimodal Trajectory Optimization [61.13228961771765]
We investigate the challenge of parametrizing policies for reinforcement learning in high-dimensional continuous action spaces.
We propose a principled framework that models the continuous RL policy as a generative model of optimal trajectories.
We present a practical model-based RL method, which leverages the multimodal policy parameterization and learned world model.
arXiv Detail & Related papers (2023-07-20T09:05:46Z) - Max-Min Off-Policy Actor-Critic Method Focusing on Worst-Case Robustness
to Model Misspecification [22.241676350331968]
This study focuses on scenarios involving a simulation environment with uncertainty parameters and the set of their possible values.
The aim is to optimize the worst-case performance on the uncertainty parameter set to guarantee the performance in the corresponding real-world environment.
Experiments in multi-joint dynamics with contact (MuJoCo) environments show that the proposed method exhibited a worst-case performance superior to several baseline approaches.
arXiv Detail & Related papers (2022-11-07T10:18:31Z) - Distributionally Robust Models with Parametric Likelihood Ratios [123.05074253513935]
Three simple ideas allow us to train models with DRO using a broader class of parametric likelihood ratios.
We find that models trained with the resulting parametric adversaries are consistently more robust to subpopulation shifts when compared to other DRO approaches.
arXiv Detail & Related papers (2022-04-13T12:43:12Z) - A Regularized Implicit Policy for Offline Reinforcement Learning [54.7427227775581]
offline reinforcement learning enables learning from a fixed dataset, without further interactions with the environment.
We propose a framework that supports learning a flexible yet well-regularized fully-implicit policy.
Experiments and ablation study on the D4RL dataset validate our framework and the effectiveness of our algorithmic designs.
arXiv Detail & Related papers (2022-02-19T20:22:04Z) - Distributionally Robust Policy Learning via Adversarial Environment
Generation [3.42658286826597]
We propose DRAGEN - Distributionally Robust policy learning via Adversarial Generation of ENvironments.
We learn a generative model for environments whose latent variables capture cost-predictive and realistic variations in environments.
We demonstrate strong Out-of-Distribution (OoD) generalization in simulation for grasping realistic 2D/3D objects.
arXiv Detail & Related papers (2021-07-13T19:26:34Z) - Emergent Complexity and Zero-shot Transfer via Unsupervised Environment
Design [121.73425076217471]
We propose Unsupervised Environment Design (UED), where developers provide environments with unknown parameters, and these parameters are used to automatically produce a distribution over valid, solvable environments.
We call our technique Protagonist Antagonist Induced Regret Environment Design (PAIRED)
Our experiments demonstrate that PAIRED produces a natural curriculum of increasingly complex environments, and PAIRED agents achieve higher zero-shot transfer performance when tested in highly novel environments.
arXiv Detail & Related papers (2020-12-03T17:37:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.