Offline Meta-Reinforcement Learning with Online Self-Supervision
- URL: http://arxiv.org/abs/2107.03974v1
- Date: Thu, 8 Jul 2021 17:01:32 GMT
- Title: Offline Meta-Reinforcement Learning with Online Self-Supervision
- Authors: Vitchyr H. Pong, Ashvin Nair, Laura Smith, Catherine Huang, Sergey
Levine
- Abstract summary: We propose a hybrid offline meta-RL algorithm, which uses offline data with rewards to meta-train an adaptive policy.
Our method uses the offline data to learn the distribution of reward functions, which is then sampled to self-supervise reward labels for the additional online data.
We find that using additional data and self-generated rewards significantly improves an agent's ability to generalize.
- Score: 66.42016534065276
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Meta-reinforcement learning (RL) can be used to train policies that quickly
adapt to new tasks with orders of magnitude less data than standard RL, but
this fast adaptation often comes at the cost of greatly increasing the amount
of reward supervision during meta-training time. Offline meta-RL removes the
need to continuously provide reward supervision because rewards must only be
provided once when the offline dataset is generated. In addition to the
challenges of offline RL, a unique distribution shift is present in meta RL:
agents learn exploration strategies that can gather the experience needed to
learn a new task, and also learn adaptation strategies that work well when
presented with the trajectories in the dataset, but the adaptation strategies
are not adapted to the data distribution that the learned exploration
strategies collect. Unlike the online setting, the adaptation and exploration
strategies cannot effectively adapt to each other, resulting in poor
performance. In this paper, we propose a hybrid offline meta-RL algorithm,
which uses offline data with rewards to meta-train an adaptive policy, and then
collects additional unsupervised online data, without any ground truth reward
labels, to bridge this distribution shift problem. Our method uses the offline
data to learn the distribution of reward functions, which is then sampled to
self-supervise reward labels for the additional online data. By removing the
need to provide reward labels for the online experience, our approach can be
more practical to use in settings where reward supervision would otherwise be
provided manually. We compare our method to prior work on offline meta-RL on
simulated robot locomotion and manipulation tasks and find that using
additional data and self-generated rewards significantly improves an agent's
ability to generalize.
Related papers
- Real-World Offline Reinforcement Learning from Vision Language Model Feedback [19.494335952082466]
offline reinforcement learning can enable policy learning from pre-collected, sub-optimal datasets without online interactions.
Most existing offline RL works assume the dataset is already labeled with the task rewards.
We propose a novel system that automatically generates reward labels for offline datasets.
arXiv Detail & Related papers (2024-11-08T02:12:34Z) - Leveraging Skills from Unlabeled Prior Data for Efficient Online Exploration [54.8229698058649]
We study how unlabeled prior trajectory data can be leveraged to learn efficient exploration strategies.
Our method SUPE (Skills from Unlabeled Prior data for Exploration) demonstrates that a careful combination of these ideas compounds their benefits.
We empirically show that SUPE reliably outperforms prior strategies, successfully solving a suite of long-horizon, sparse-reward tasks.
arXiv Detail & Related papers (2024-10-23T17:58:45Z) - Offline Reinforcement Learning from Datasets with Structured Non-Stationarity [50.35634234137108]
Current Reinforcement Learning (RL) is often limited by the large amount of data needed to learn a successful policy.
We address a novel Offline RL problem setting in which, while collecting the dataset, the transition and reward functions gradually change between episodes but stay constant within each episode.
We propose a method based on Contrastive Predictive Coding that identifies this non-stationarity in the offline dataset, accounts for it when training a policy, and predicts it during evaluation.
arXiv Detail & Related papers (2024-05-23T02:41:36Z) - Adaptive Policy Learning for Offline-to-Online Reinforcement Learning [27.80266207283246]
We consider an offline-to-online setting where the agent is first learned from the offline dataset and then trained online.
We propose a framework called Adaptive Policy Learning for effectively taking advantage of offline and online data.
arXiv Detail & Related papers (2023-03-14T08:13:21Z) - Benchmarks and Algorithms for Offline Preference-Based Reward Learning [41.676208473752425]
We propose an approach that uses an offline dataset to craft preference queries via pool-based active learning.
Our proposed approach does not require actual physical rollouts or an accurate simulator for either the reward learning or policy optimization steps.
arXiv Detail & Related papers (2023-01-03T23:52:16Z) - Boosting Offline Reinforcement Learning via Data Rebalancing [104.3767045977716]
offline reinforcement learning (RL) is challenged by the distributional shift between learning policies and datasets.
We propose a simple yet effective method to boost offline RL algorithms based on the observation that resampling a dataset keeps the distribution support unchanged.
We dub our method ReD (Return-based Data Rebalance), which can be implemented with less than 10 lines of code change and adds negligible running time.
arXiv Detail & Related papers (2022-10-17T16:34:01Z) - Don't Change the Algorithm, Change the Data: Exploratory Data for
Offline Reinforcement Learning [147.61075994259807]
We propose Exploratory data for Offline RL (ExORL), a data-centric approach to offline RL.
ExORL first generates data with unsupervised reward-free exploration, then relabels this data with a downstream reward before training a policy with offline RL.
We find that exploratory data allows vanilla off-policy RL algorithms, without any offline-specific modifications, to outperform or match state-of-the-art offline RL algorithms on downstream tasks.
arXiv Detail & Related papers (2022-01-31T18:39:27Z) - Offline Meta-Reinforcement Learning with Advantage Weighting [125.21298190780259]
This paper introduces the offline meta-reinforcement learning (offline meta-RL) problem setting and proposes an algorithm that performs well in this setting.
offline meta-RL is analogous to the widely successful supervised learning strategy of pre-training a model on a large batch of fixed, pre-collected data.
We propose Meta-Actor Critic with Advantage Weighting (MACAW), an optimization-based meta-learning algorithm that uses simple, supervised regression objectives for both the inner and outer loop of meta-training.
arXiv Detail & Related papers (2020-08-13T17:57:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.