Self-Supervised Policy Adaptation during Deployment
- URL: http://arxiv.org/abs/2007.04309v3
- Date: Fri, 9 Apr 2021 02:47:39 GMT
- Title: Self-Supervised Policy Adaptation during Deployment
- Authors: Nicklas Hansen, Rishabh Jangir, Yu Sun, Guillem Aleny\`a, Pieter
Abbeel, Alexei A. Efros, Lerrel Pinto, Xiaolong Wang
- Abstract summary: Self-supervision allows the policy to continue training after deployment without using any rewards.
Empirical evaluations are performed on diverse simulation environments from DeepMind Control suite and ViZDoom.
Our method improves generalization in 31 out of 36 environments across various tasks and outperforms domain randomization on a majority of environments.
- Score: 98.25486842109936
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In most real world scenarios, a policy trained by reinforcement learning in
one environment needs to be deployed in another, potentially quite different
environment. However, generalization across different environments is known to
be hard. A natural solution would be to keep training after deployment in the
new environment, but this cannot be done if the new environment offers no
reward signal. Our work explores the use of self-supervision to allow the
policy to continue training after deployment without using any rewards. While
previous methods explicitly anticipate changes in the new environment, we
assume no prior knowledge of those changes yet still obtain significant
improvements. Empirical evaluations are performed on diverse simulation
environments from DeepMind Control suite and ViZDoom, as well as real robotic
manipulation tasks in continuously changing environments, taking observations
from an uncalibrated camera. Our method improves generalization in 31 out of 36
environments across various tasks and outperforms domain randomization on a
majority of environments.
Related papers
- Survival of the Fittest: Evolutionary Adaptation of Policies for Environmental Shifts [0.15889427269227555]
We develop an adaptive re-training algorithm inspired by evolutionary game theory (EGT)
ERPO shows faster policy adaptation, higher average rewards, and reduced computational costs in policy adaptation.
arXiv Detail & Related papers (2024-10-22T09:29:53Z) - No Regrets: Investigating and Improving Regret Approximations for Curriculum Discovery [53.08822154199948]
Unsupervised Environment Design (UED) methods have gained recent attention as their adaptive curricula promise to enable agents to be robust to in- and out-of-distribution tasks.
This work investigates how existing UED methods select training environments, focusing on task prioritisation metrics.
We develop a method that directly trains on scenarios with high learnability.
arXiv Detail & Related papers (2024-08-27T14:31:54Z) - EvIL: Evolution Strategies for Generalisable Imitation Learning [33.745657379141676]
In imitation learning (IL) expert demonstrations and the environment we want to deploy our learned policy in aren't exactly the same.
Compared to policy-centric approaches to IL like cloning, reward-centric approaches like inverse reinforcement learning (IRL) often better replicate expert behaviour in new environments.
We find that modern deep IL algorithms frequently recover rewards which induce policies far weaker than the expert, even in the same environment the demonstrations were collected in.
We propose a novel evolution-strategies based method EvIL to optimise for a reward-shaping term that speeds up re-training in the target environment.
arXiv Detail & Related papers (2024-06-15T22:46:39Z) - A Behavior-Aware Approach for Deep Reinforcement Learning in Non-stationary Environments without Known Change Points [30.077746056549678]
This research introduces Behavior-Aware Detection and Adaptation (BADA), an innovative framework that merges environmental change detection with behavior adaptation.
The key inspiration behind our method is that policies exhibit different global behaviors in changing environments.
The results of a series of experiments demonstrate better performance relative to several current algorithms.
arXiv Detail & Related papers (2024-05-23T06:17:26Z) - Improving adaptability to new environments and removing catastrophic
forgetting in Reinforcement Learning by using an eco-system of agents [3.5786621294068373]
Adapting a Reinforcement Learning (RL) agent to an unseen environment is a difficult task due to typical over-fitting on the training environment.
There is a risk of catastrophic forgetting, where the performance on previously seen environments is seriously hampered.
This paper proposes a novel approach that exploits an ecosystem of agents to address both concerns.
arXiv Detail & Related papers (2022-04-13T17:52:54Z) - Continual Predictive Learning from Videos [100.27176974654559]
We study a new continual learning problem in the context of video prediction.
We propose the continual predictive learning (CPL) approach, which learns a mixture world model via predictive experience replay.
We construct two new benchmarks based on RoboNet and KTH, in which different tasks correspond to different physical robotic environments or human actions.
arXiv Detail & Related papers (2022-04-12T08:32:26Z) - EnvEdit: Environment Editing for Vision-and-Language Navigation [98.30038910061894]
In Vision-and-Language Navigation (VLN), an agent needs to navigate through the environment based on natural language instructions.
We propose EnvEdit, a data augmentation method that creates new environments by editing existing environments.
We show that our proposed EnvEdit method gets significant improvements in all metrics on both pre-trained and non-pre-trained VLN agents.
arXiv Detail & Related papers (2022-03-29T15:44:32Z) - One Solution is Not All You Need: Few-Shot Extrapolation via Structured
MaxEnt RL [142.36621929739707]
We show that learning diverse behaviors for accomplishing a task can lead to behavior that generalizes to varying environments.
By identifying multiple solutions for the task in a single environment during training, our approach can generalize to new situations.
arXiv Detail & Related papers (2020-10-27T17:41:57Z) - Fast Adaptation via Policy-Dynamics Value Functions [41.738462615120326]
We introduce Policy-Dynamics Value Functions (PD-VF), a novel approach for rapidly adapting to dynamics different from those previously seen in training.
PD-VF explicitly estimates the cumulative reward in a space of policies and environments.
We show that our method can rapidly adapt to new dynamics on a set of MuJoCo domains.
arXiv Detail & Related papers (2020-07-06T16:47:56Z) - Deep Reinforcement Learning amidst Lifelong Non-Stationarity [67.24635298387624]
We show that an off-policy RL algorithm can reason about and tackle lifelong non-stationarity.
Our method leverages latent variable models to learn a representation of the environment from current and past experiences.
We also introduce several simulation environments that exhibit lifelong non-stationarity, and empirically find that our approach substantially outperforms approaches that do not reason about environment shift.
arXiv Detail & Related papers (2020-06-18T17:34:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.