Provably Efficient Model-based Policy Adaptation
- URL: http://arxiv.org/abs/2006.08051v1
- Date: Sun, 14 Jun 2020 23:16:20 GMT
- Title: Provably Efficient Model-based Policy Adaptation
- Authors: Yuda Song, Aditi Mavalankar, Wen Sun, Sicun Gao
- Abstract summary: A promising approach is to quickly adapt pre-trained policies to new environments.
Existing methods for this policy adaptation problem typically rely on domain randomization and meta-learning.
We propose new model-based mechanisms that are able to make online adaptation in unseen target environments.
- Score: 22.752774605277555
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The high sample complexity of reinforcement learning challenges its use in
practice. A promising approach is to quickly adapt pre-trained policies to new
environments. Existing methods for this policy adaptation problem typically
rely on domain randomization and meta-learning, by sampling from some
distribution of target environments during pre-training, and thus face
difficulty on out-of-distribution target environments. We propose new
model-based mechanisms that are able to make online adaptation in unseen target
environments, by combining ideas from no-regret online learning and adaptive
control. We prove that the approach learns policies in the target environment
that can quickly recover trajectories from the source environment, and
establish the rate of convergence in general settings. We demonstrate the
benefits of our approach for policy adaptation in a diverse set of continuous
control tasks, achieving the performance of state-of-the-art methods with much
lower sample complexity.
Related papers
- Survival of the Fittest: Evolutionary Adaptation of Policies for Environmental Shifts [0.15889427269227555]
We develop an adaptive re-training algorithm inspired by evolutionary game theory (EGT)
ERPO shows faster policy adaptation, higher average rewards, and reduced computational costs in policy adaptation.
arXiv Detail & Related papers (2024-10-22T09:29:53Z) - A Conservative Approach for Few-Shot Transfer in Off-Dynamics Reinforcement Learning [3.1515473193934778]
Off-dynamics Reinforcement Learning seeks to transfer a policy from a source environment to a target environment characterized by distinct yet similar dynamics.
We propose an innovative approach inspired by recent advancements in Imitation Learning and conservative RL algorithms.
arXiv Detail & Related papers (2023-12-24T13:09:08Z) - Reparameterized Policy Learning for Multimodal Trajectory Optimization [61.13228961771765]
We investigate the challenge of parametrizing policies for reinforcement learning in high-dimensional continuous action spaces.
We propose a principled framework that models the continuous RL policy as a generative model of optimal trajectories.
We present a practical model-based RL method, which leverages the multimodal policy parameterization and learned world model.
arXiv Detail & Related papers (2023-07-20T09:05:46Z) - Acceleration in Policy Optimization [50.323182853069184]
We work towards a unifying paradigm for accelerating policy optimization methods in reinforcement learning (RL) by integrating foresight in the policy improvement step via optimistic and adaptive updates.
We define optimism as predictive modelling of the future behavior of a policy, and adaptivity as taking immediate and anticipatory corrective actions to mitigate errors from overshooting predictions or delayed responses to change.
We design an optimistic policy gradient algorithm, adaptive via meta-gradient learning, and empirically highlight several design choices pertaining to acceleration, in an illustrative task.
arXiv Detail & Related papers (2023-06-18T15:50:57Z) - PAnDR: Fast Adaptation to New Environments from Offline Experiences via
Decoupling Policy and Environment Representations [39.11141327059819]
We propose Policy Adaptation with Decoupled Representations (PAnDR) for fast policy adaptation.
In offline training phase, the environment representation and policy representation are learned through contrastive learning and policy recovery.
In online adaptation phase, with the environment context inferred from few experiences collected in new environments, the policy is optimized by gradient ascent.
arXiv Detail & Related papers (2022-04-06T14:47:35Z) - A Regularized Implicit Policy for Offline Reinforcement Learning [54.7427227775581]
offline reinforcement learning enables learning from a fixed dataset, without further interactions with the environment.
We propose a framework that supports learning a flexible yet well-regularized fully-implicit policy.
Experiments and ablation study on the D4RL dataset validate our framework and the effectiveness of our algorithmic designs.
arXiv Detail & Related papers (2022-02-19T20:22:04Z) - Fast Model-based Policy Search for Universal Policy Networks [45.44896435487879]
Adapting an agent's behaviour to new environments has been one of the primary focus areas of physics based reinforcement learning.
We propose a Gaussian Process-based prior learned in simulation, that captures the likely performance of a policy when transferred to a previously unseen environment.
We integrate this prior with a Bayesian optimisation-based policy search process to improve the efficiency of identifying the most appropriate policy from the universal policy network.
arXiv Detail & Related papers (2022-02-11T18:08:02Z) - Learning Adaptive Exploration Strategies in Dynamic Environments Through
Informed Policy Regularization [100.72335252255989]
We study the problem of learning exploration-exploitation strategies that effectively adapt to dynamic environments.
We propose a novel algorithm that regularizes the training of an RNN-based policy using informed policies trained to maximize the reward in each task.
arXiv Detail & Related papers (2020-05-06T16:14:48Z) - Unsupervised Domain Adaptation in Person re-ID via k-Reciprocal
Clustering and Large-Scale Heterogeneous Environment Synthesis [76.46004354572956]
We introduce an unsupervised domain adaptation approach for person re-identification.
Experimental results show that the proposed ktCUDA and SHRED approach achieves an average improvement of +5.7 mAP in re-identification performance.
arXiv Detail & Related papers (2020-01-14T17:43:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.