Learning a subspace of policies for online adaptation in Reinforcement
Learning
- URL: http://arxiv.org/abs/2110.05169v1
- Date: Mon, 11 Oct 2021 11:43:34 GMT
- Title: Learning a subspace of policies for online adaptation in Reinforcement
Learning
- Authors: Jean-Baptiste Gaya, Laure Soulier, Ludovic Denoyer
- Abstract summary: In control systems, the robot on which a policy is learned might differ from the robot on which a policy will run.
There is a need to develop RL methods that generalize well to variations of the training conditions.
In this article, we consider the simplest yet hard to tackle generalization setting where the test environment is unknown at train time.
- Score: 14.7945053644125
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Deep Reinforcement Learning (RL) is mainly studied in a setting where the
training and the testing environments are similar. But in many practical
applications, these environments may differ. For instance, in control systems,
the robot(s) on which a policy is learned might differ from the robot(s) on
which a policy will run. It can be caused by different internal factors (e.g.,
calibration issues, system attrition, defective modules) or also by external
changes (e.g., weather conditions). There is a need to develop RL methods that
generalize well to variations of the training conditions. In this article, we
consider the simplest yet hard to tackle generalization setting where the test
environment is unknown at train time, forcing the agent to adapt to the
system's new dynamics. This online adaptation process can be computationally
expensive (e.g., fine-tuning) and cannot rely on meta-RL techniques since there
is just a single train environment. To do so, we propose an approach where we
learn a subspace of policies within the parameter space. This subspace contains
an infinite number of policies that are trained to solve the training
environment while having different parameter values. As a consequence, two
policies in that subspace process information differently and exhibit different
behaviors when facing variations of the train environment. Our experiments
carried out over a large variety of benchmarks compare our approach with
baselines, including diversity-based methods. In comparison, our approach is
simple to tune, does not need any extra component (e.g., discriminator) and
learns policies able to gather a high reward on unseen environments.
Related papers
- OMPO: A Unified Framework for RL under Policy and Dynamics Shifts [42.57662196581823]
Training reinforcement learning policies using environment interaction data collected from varying policies or dynamics presents a fundamental challenge.
Existing works often overlook the distribution discrepancies induced by policy or dynamics shifts, or rely on specialized algorithms with task priors.
In this paper, we identify a unified strategy for online RL policy learning under diverse settings of policy and dynamics shifts: transition occupancy matching.
arXiv Detail & Related papers (2024-05-29T13:36:36Z) - Action-Quantized Offline Reinforcement Learning for Robotic Skill
Learning [68.16998247593209]
offline reinforcement learning (RL) paradigm provides recipe to convert static behavior datasets into policies that can perform better than the policy that collected the data.
In this paper, we propose an adaptive scheme for action quantization.
We show that several state-of-the-art offline RL methods such as IQL, CQL, and BRAC improve in performance on benchmarks when combined with our proposed discretization scheme.
arXiv Detail & Related papers (2023-10-18T06:07:10Z) - Hundreds Guide Millions: Adaptive Offline Reinforcement Learning with
Expert Guidance [74.31779732754697]
We propose a novel plug-in approach named Guided Offline RL (GORL)
GORL employs a guiding network, along with only a few expert demonstrations, to adaptively determine the relative importance of the policy improvement and policy constraint for every sample.
Experiments on various environments suggest that GORL can be easily installed on most offline RL algorithms with statistically significant performance improvements.
arXiv Detail & Related papers (2023-09-04T08:59:04Z) - Adaptive Tracking of a Single-Rigid-Body Character in Various
Environments [2.048226951354646]
We propose a deep reinforcement learning method based on the simulation of a single-rigid-body character.
Using the centroidal dynamics model (CDM) to express the full-body character as a single rigid body (SRB) and training a policy to track a reference motion, we can obtain a policy capable of adapting to various unobserved environmental changes.
We demonstrate that our policy, efficiently trained within 30 minutes on an ultraportable laptop, has the ability to cope with environments that have not been experienced during learning.
arXiv Detail & Related papers (2023-08-14T22:58:54Z) - Diversity Through Exclusion (DTE): Niche Identification for
Reinforcement Learning through Value-Decomposition [63.67574523750839]
We propose a generic reinforcement learning (RL) algorithm that performs better than baseline deep Q-learning algorithms in environments with multiple variably-valued niches.
We show that agents trained this way can escape poor-but-attractive local optima to instead converge to harder-to-discover higher value strategies.
arXiv Detail & Related papers (2023-02-02T16:00:19Z) - Self-Supervised Policy Adaptation during Deployment [98.25486842109936]
Self-supervision allows the policy to continue training after deployment without using any rewards.
Empirical evaluations are performed on diverse simulation environments from DeepMind Control suite and ViZDoom.
Our method improves generalization in 31 out of 36 environments across various tasks and outperforms domain randomization on a majority of environments.
arXiv Detail & Related papers (2020-07-08T17:56:27Z) - Fast Adaptation via Policy-Dynamics Value Functions [41.738462615120326]
We introduce Policy-Dynamics Value Functions (PD-VF), a novel approach for rapidly adapting to dynamics different from those previously seen in training.
PD-VF explicitly estimates the cumulative reward in a space of policies and environments.
We show that our method can rapidly adapt to new dynamics on a set of MuJoCo domains.
arXiv Detail & Related papers (2020-07-06T16:47:56Z) - Deep Reinforcement Learning amidst Lifelong Non-Stationarity [67.24635298387624]
We show that an off-policy RL algorithm can reason about and tackle lifelong non-stationarity.
Our method leverages latent variable models to learn a representation of the environment from current and past experiences.
We also introduce several simulation environments that exhibit lifelong non-stationarity, and empirically find that our approach substantially outperforms approaches that do not reason about environment shift.
arXiv Detail & Related papers (2020-06-18T17:34:50Z) - Learning Adaptive Exploration Strategies in Dynamic Environments Through
Informed Policy Regularization [100.72335252255989]
We study the problem of learning exploration-exploitation strategies that effectively adapt to dynamic environments.
We propose a novel algorithm that regularizes the training of an RNN-based policy using informed policies trained to maximize the reward in each task.
arXiv Detail & Related papers (2020-05-06T16:14:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.