PAnDR: Fast Adaptation to New Environments from Offline Experiences via
Decoupling Policy and Environment Representations
- URL: http://arxiv.org/abs/2204.02877v1
- Date: Wed, 6 Apr 2022 14:47:35 GMT
- Title: PAnDR: Fast Adaptation to New Environments from Offline Experiences via
Decoupling Policy and Environment Representations
- Authors: Tong Sang, Hongyao Tang, Yi Ma, Jianye Hao, Yan Zheng, Zhaopeng Meng,
Boyan Li, Zhen Wang
- Abstract summary: We propose Policy Adaptation with Decoupled Representations (PAnDR) for fast policy adaptation.
In offline training phase, the environment representation and policy representation are learned through contrastive learning and policy recovery.
In online adaptation phase, with the environment context inferred from few experiences collected in new environments, the policy is optimized by gradient ascent.
- Score: 39.11141327059819
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Deep Reinforcement Learning (DRL) has been a promising solution to many
complex decision-making problems. Nevertheless, the notorious weakness in
generalization among environments prevent widespread application of DRL agents
in real-world scenarios. Although advances have been made recently, most prior
works assume sufficient online interaction on training environments, which can
be costly in practical cases. To this end, we focus on an
\textit{offline-training-online-adaptation} setting, in which the agent first
learns from offline experiences collected in environments with different
dynamics and then performs online policy adaptation in environments with new
dynamics. In this paper, we propose Policy Adaptation with Decoupled
Representations (PAnDR) for fast policy adaptation. In offline training phase,
the environment representation and policy representation are learned through
contrastive learning and policy recovery, respectively. The representations are
further refined by mutual information optimization to make them more decoupled
and complete. With learned representations, a Policy-Dynamics Value Function
(PDVF) (Raileanu et al., 2020) network is trained to approximate the values for
different combinations of policies and environments. In online adaptation
phase, with the environment context inferred from few experiences collected in
new environments, the policy is optimized by gradient ascent with respect to
the PDVF. Our experiments show that PAnDR outperforms existing algorithms in
several representative policy adaptation problems.
Related papers
- Survival of the Fittest: Evolutionary Adaptation of Policies for Environmental Shifts [0.15889427269227555]
We develop an adaptive re-training algorithm inspired by evolutionary game theory (EGT)
ERPO shows faster policy adaptation, higher average rewards, and reduced computational costs in policy adaptation.
arXiv Detail & Related papers (2024-10-22T09:29:53Z) - Hundreds Guide Millions: Adaptive Offline Reinforcement Learning with
Expert Guidance [74.31779732754697]
We propose a novel plug-in approach named Guided Offline RL (GORL)
GORL employs a guiding network, along with only a few expert demonstrations, to adaptively determine the relative importance of the policy improvement and policy constraint for every sample.
Experiments on various environments suggest that GORL can be easily installed on most offline RL algorithms with statistically significant performance improvements.
arXiv Detail & Related papers (2023-09-04T08:59:04Z) - Dynamics-Adaptive Continual Reinforcement Learning via Progressive
Contextualization [29.61829620717385]
Key challenge of continual reinforcement learning (CRL) in dynamic environments is to promptly adapt the RL agent's behavior as the environment changes over its lifetime.
DaCoRL learns a context-conditioned policy using progressive contextualization.
DaCoRL features consistent superiority over existing methods in terms of the stability, overall performance and generalization ability.
arXiv Detail & Related papers (2022-09-01T10:26:58Z) - Latent-Variable Advantage-Weighted Policy Optimization for Offline RL [70.01851346635637]
offline reinforcement learning methods hold the promise of learning policies from pre-collected datasets without the need to query the environment for new transitions.
In practice, offline datasets are often heterogeneous, i.e., collected in a variety of scenarios.
We propose to leverage latent-variable policies that can represent a broader class of policy distributions.
Our method improves the average performance of the next best-performing offline reinforcement learning methods by 49% on heterogeneous datasets.
arXiv Detail & Related papers (2022-03-16T21:17:03Z) - Fast Model-based Policy Search for Universal Policy Networks [45.44896435487879]
Adapting an agent's behaviour to new environments has been one of the primary focus areas of physics based reinforcement learning.
We propose a Gaussian Process-based prior learned in simulation, that captures the likely performance of a policy when transferred to a previously unseen environment.
We integrate this prior with a Bayesian optimisation-based policy search process to improve the efficiency of identifying the most appropriate policy from the universal policy network.
arXiv Detail & Related papers (2022-02-11T18:08:02Z) - Multi-Objective SPIBB: Seldonian Offline Policy Improvement with Safety
Constraints in Finite MDPs [71.47895794305883]
We study the problem of Safe Policy Improvement (SPI) under constraints in the offline Reinforcement Learning setting.
We present an SPI for this RL setting that takes into account the preferences of the algorithm's user for handling the trade-offs for different reward signals.
arXiv Detail & Related papers (2021-05-31T21:04:21Z) - Fast Adaptation via Policy-Dynamics Value Functions [41.738462615120326]
We introduce Policy-Dynamics Value Functions (PD-VF), a novel approach for rapidly adapting to dynamics different from those previously seen in training.
PD-VF explicitly estimates the cumulative reward in a space of policies and environments.
We show that our method can rapidly adapt to new dynamics on a set of MuJoCo domains.
arXiv Detail & Related papers (2020-07-06T16:47:56Z) - Deep Reinforcement Learning amidst Lifelong Non-Stationarity [67.24635298387624]
We show that an off-policy RL algorithm can reason about and tackle lifelong non-stationarity.
Our method leverages latent variable models to learn a representation of the environment from current and past experiences.
We also introduce several simulation environments that exhibit lifelong non-stationarity, and empirically find that our approach substantially outperforms approaches that do not reason about environment shift.
arXiv Detail & Related papers (2020-06-18T17:34:50Z) - Non-Stationary Off-Policy Optimization [50.41335279896062]
We study the novel problem of off-policy optimization in piecewise-stationary contextual bandits.
In the offline learning phase, we partition logged data into categorical latent states and learn a near-optimal sub-policy for each state.
In the online deployment phase, we adaptively switch between the learned sub-policies based on their performance.
arXiv Detail & Related papers (2020-06-15T09:16:09Z) - Provably Efficient Model-based Policy Adaptation [22.752774605277555]
A promising approach is to quickly adapt pre-trained policies to new environments.
Existing methods for this policy adaptation problem typically rely on domain randomization and meta-learning.
We propose new model-based mechanisms that are able to make online adaptation in unseen target environments.
arXiv Detail & Related papers (2020-06-14T23:16:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.