Fast Adaptation via Policy-Dynamics Value Functions
- URL: http://arxiv.org/abs/2007.02879v1
- Date: Mon, 6 Jul 2020 16:47:56 GMT
- Title: Fast Adaptation via Policy-Dynamics Value Functions
- Authors: Roberta Raileanu, Max Goldstein, Arthur Szlam, Rob Fergus
- Abstract summary: We introduce Policy-Dynamics Value Functions (PD-VF), a novel approach for rapidly adapting to dynamics different from those previously seen in training.
PD-VF explicitly estimates the cumulative reward in a space of policies and environments.
We show that our method can rapidly adapt to new dynamics on a set of MuJoCo domains.
- Score: 41.738462615120326
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Standard RL algorithms assume fixed environment dynamics and require a
significant amount of interaction to adapt to new environments. We introduce
Policy-Dynamics Value Functions (PD-VF), a novel approach for rapidly adapting
to dynamics different from those previously seen in training. PD-VF explicitly
estimates the cumulative reward in a space of policies and environments. An
ensemble of conventional RL policies is used to gather experience on training
environments, from which embeddings of both policies and environments can be
learned. Then, a value function conditioned on both embeddings is trained. At
test time, a few actions are sufficient to infer the environment embedding,
enabling a policy to be selected by maximizing the learned value function
(which requires no additional environment interaction). We show that our method
can rapidly adapt to new dynamics on a set of MuJoCo domains. Code available at
https://github.com/rraileanu/policy-dynamics-value-functions.
Related papers
- Survival of the Fittest: Evolutionary Adaptation of Policies for Environmental Shifts [0.15889427269227555]
We develop an adaptive re-training algorithm inspired by evolutionary game theory (EGT)
ERPO shows faster policy adaptation, higher average rewards, and reduced computational costs in policy adaptation.
arXiv Detail & Related papers (2024-10-22T09:29:53Z) - OMPO: A Unified Framework for RL under Policy and Dynamics Shifts [42.57662196581823]
Training reinforcement learning policies using environment interaction data collected from varying policies or dynamics presents a fundamental challenge.
Existing works often overlook the distribution discrepancies induced by policy or dynamics shifts, or rely on specialized algorithms with task priors.
In this paper, we identify a unified strategy for online RL policy learning under diverse settings of policy and dynamics shifts: transition occupancy matching.
arXiv Detail & Related papers (2024-05-29T13:36:36Z) - Reparameterized Policy Learning for Multimodal Trajectory Optimization [61.13228961771765]
We investigate the challenge of parametrizing policies for reinforcement learning in high-dimensional continuous action spaces.
We propose a principled framework that models the continuous RL policy as a generative model of optimal trajectories.
We present a practical model-based RL method, which leverages the multimodal policy parameterization and learned world model.
arXiv Detail & Related papers (2023-07-20T09:05:46Z) - PAnDR: Fast Adaptation to New Environments from Offline Experiences via
Decoupling Policy and Environment Representations [39.11141327059819]
We propose Policy Adaptation with Decoupled Representations (PAnDR) for fast policy adaptation.
In offline training phase, the environment representation and policy representation are learned through contrastive learning and policy recovery.
In online adaptation phase, with the environment context inferred from few experiences collected in new environments, the policy is optimized by gradient ascent.
arXiv Detail & Related papers (2022-04-06T14:47:35Z) - Fast Model-based Policy Search for Universal Policy Networks [45.44896435487879]
Adapting an agent's behaviour to new environments has been one of the primary focus areas of physics based reinforcement learning.
We propose a Gaussian Process-based prior learned in simulation, that captures the likely performance of a policy when transferred to a previously unseen environment.
We integrate this prior with a Bayesian optimisation-based policy search process to improve the efficiency of identifying the most appropriate policy from the universal policy network.
arXiv Detail & Related papers (2022-02-11T18:08:02Z) - Learning a subspace of policies for online adaptation in Reinforcement
Learning [14.7945053644125]
In control systems, the robot on which a policy is learned might differ from the robot on which a policy will run.
There is a need to develop RL methods that generalize well to variations of the training conditions.
In this article, we consider the simplest yet hard to tackle generalization setting where the test environment is unknown at train time.
arXiv Detail & Related papers (2021-10-11T11:43:34Z) - Learning to Continuously Optimize Wireless Resource in a Dynamic
Environment: A Bilevel Optimization Perspective [52.497514255040514]
This work develops a new approach that enables data-driven methods to continuously learn and optimize resource allocation strategies in a dynamic environment.
We propose to build the notion of continual learning into wireless system design, so that the learning model can incrementally adapt to the new episodes.
Our design is based on a novel bilevel optimization formulation which ensures certain fairness" across different data samples.
arXiv Detail & Related papers (2021-05-03T07:23:39Z) - Learning to Continuously Optimize Wireless Resource In Episodically
Dynamic Environment [55.91291559442884]
This work develops a methodology that enables data-driven methods to continuously learn and optimize in a dynamic environment.
We propose to build the notion of continual learning into the modeling process of learning wireless systems.
Our design is based on a novel min-max formulation which ensures certain fairness" across different data samples.
arXiv Detail & Related papers (2020-11-16T08:24:34Z) - Self-Supervised Policy Adaptation during Deployment [98.25486842109936]
Self-supervision allows the policy to continue training after deployment without using any rewards.
Empirical evaluations are performed on diverse simulation environments from DeepMind Control suite and ViZDoom.
Our method improves generalization in 31 out of 36 environments across various tasks and outperforms domain randomization on a majority of environments.
arXiv Detail & Related papers (2020-07-08T17:56:27Z) - Learning Adaptive Exploration Strategies in Dynamic Environments Through
Informed Policy Regularization [100.72335252255989]
We study the problem of learning exploration-exploitation strategies that effectively adapt to dynamic environments.
We propose a novel algorithm that regularizes the training of an RNN-based policy using informed policies trained to maximize the reward in each task.
arXiv Detail & Related papers (2020-05-06T16:14:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.