Curriculum Offline Imitation Learning
- URL: http://arxiv.org/abs/2111.02056v1
- Date: Wed, 3 Nov 2021 08:02:48 GMT
- Title: Curriculum Offline Imitation Learning
- Authors: Minghuan Liu, Hanye Zhao, Zhengyu Yang, Jian Shen, Weinan Zhang, Li
Zhao, Tie-Yan Liu
- Abstract summary: offline reinforcement learning tasks require the agent to learn from a pre-collected dataset with no further interactions with the environment.
We propose textitCurriculum Offline Learning (COIL), which utilizes an experience picking strategy for imitating from adaptive neighboring policies with a higher return.
On continuous control benchmarks, we compare COIL against both imitation-based and RL-based methods, showing that it not only avoids just learning a mediocre behavior on mixed datasets but is also even competitive with state-of-the-art offline RL methods.
- Score: 72.1015201041391
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Offline reinforcement learning (RL) tasks require the agent to learn from a
pre-collected dataset with no further interactions with the environment.
Despite the potential to surpass the behavioral policies, RL-based methods are
generally impractical due to the training instability and bootstrapping the
extrapolation errors, which always require careful hyperparameter tuning via
online evaluation. In contrast, offline imitation learning (IL) has no such
issues since it learns the policy directly without estimating the value
function by bootstrapping. However, IL is usually limited in the capability of
the behavioral policy and tends to learn a mediocre behavior from the dataset
collected by the mixture of policies. In this paper, we aim to take advantage
of IL but mitigate such a drawback. Observing that behavior cloning is able to
imitate neighboring policies with less data, we propose \textit{Curriculum
Offline Imitation Learning (COIL)}, which utilizes an experience picking
strategy for imitating from adaptive neighboring policies with a higher return,
and improves the current policy along curriculum stages. On continuous control
benchmarks, we compare COIL against both imitation-based and RL-based methods,
showing that it not only avoids just learning a mediocre behavior on mixed
datasets but is also even competitive with state-of-the-art offline RL methods.
Related papers
- Offline Retraining for Online RL: Decoupled Policy Learning to Mitigate
Exploration Bias [96.14064037614942]
offline retraining, a policy extraction step at the end of online fine-tuning, is proposed.
An optimistic (exploration) policy is used to interact with the environment, and a separate pessimistic (exploitation) policy is trained on all the observed data for evaluation.
arXiv Detail & Related papers (2023-10-12T17:50:09Z) - Offline Reinforcement Learning with Adaptive Behavior Regularization [1.491109220586182]
offline reinforcement learning (RL) defines a sample-efficient learning paradigm, where a policy is learned from static and previously collected datasets.
We propose a novel approach, which we refer to as adaptive behavior regularization (ABR)
ABR enables the policy to adaptively adjust its optimization objective between cloning and improving over the policy used to generate the dataset.
arXiv Detail & Related papers (2022-11-15T15:59:11Z) - Offline RL With Realistic Datasets: Heteroskedasticity and Support
Constraints [82.43359506154117]
We show that typical offline reinforcement learning methods fail to learn from data with non-uniform variability.
Our method is simple, theoretically motivated, and improves performance across a wide range of offline RL problems in Atari games, navigation, and pixel-based manipulation.
arXiv Detail & Related papers (2022-11-02T11:36:06Z) - Boosting Offline Reinforcement Learning via Data Rebalancing [104.3767045977716]
offline reinforcement learning (RL) is challenged by the distributional shift between learning policies and datasets.
We propose a simple yet effective method to boost offline RL algorithms based on the observation that resampling a dataset keeps the distribution support unchanged.
We dub our method ReD (Return-based Data Rebalance), which can be implemented with less than 10 lines of code change and adds negligible running time.
arXiv Detail & Related papers (2022-10-17T16:34:01Z) - A Policy-Guided Imitation Approach for Offline Reinforcement Learning [9.195775740684248]
We introduce Policy-guided Offline RL (textttPOR)
textttPOR demonstrates the state-of-the-art performance on D4RL, a standard benchmark for offline RL.
arXiv Detail & Related papers (2022-10-15T15:54:28Z) - Regularizing a Model-based Policy Stationary Distribution to Stabilize
Offline Reinforcement Learning [62.19209005400561]
offline reinforcement learning (RL) extends the paradigm of classical RL algorithms to purely learning from static datasets.
A key challenge of offline RL is the instability of policy training, caused by the mismatch between the distribution of the offline data and the undiscounted stationary state-action distribution of the learned policy.
We regularize the undiscounted stationary distribution of the current policy towards the offline data during the policy optimization process.
arXiv Detail & Related papers (2022-06-14T20:56:16Z) - Offline Reinforcement Learning with Implicit Q-Learning [85.62618088890787]
Current offline reinforcement learning methods need to query the value of unseen actions during training to improve the policy.
We propose an offline RL method that never needs to evaluate actions outside of the dataset.
This method enables the learned policy to improve substantially over the best behavior in the data through generalization.
arXiv Detail & Related papers (2021-10-12T17:05:05Z) - BRAC+: Improved Behavior Regularized Actor Critic for Offline
Reinforcement Learning [14.432131909590824]
Offline Reinforcement Learning aims to train effective policies using previously collected datasets.
Standard off-policy RL algorithms are prone to overestimations of the values of out-of-distribution (less explored) actions.
We improve the behavior regularized offline reinforcement learning and propose BRAC+.
arXiv Detail & Related papers (2021-10-02T23:55:49Z) - POPO: Pessimistic Offline Policy Optimization [6.122342691982727]
We study why off-policy RL methods fail to learn in offline setting from the value function view.
We propose Pessimistic Offline Policy Optimization (POPO), which learns a pessimistic value function to get a strong policy.
We find that POPO performs surprisingly well and scales to tasks with high-dimensional state and action space.
arXiv Detail & Related papers (2020-12-26T06:24:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.