Offline Imitation Learning with Suboptimal Demonstrations via Relaxed
Distribution Matching
- URL: http://arxiv.org/abs/2303.02569v1
- Date: Sun, 5 Mar 2023 03:35:11 GMT
- Title: Offline Imitation Learning with Suboptimal Demonstrations via Relaxed
Distribution Matching
- Authors: Lantao Yu, Tianhe Yu, Jiaming Song, Willie Neiswanger, Stefano Ermon
- Abstract summary: offline imitation learning (IL) promises the ability to learn performant policies from pre-collected demonstrations without interactions with the environment.
We present RelaxDICE, which employs an asymmetrically-relaxed f-divergence for explicit support regularization.
Our method significantly outperforms the best prior offline method in six standard continuous control environments.
- Score: 109.5084863685397
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Offline imitation learning (IL) promises the ability to learn performant
policies from pre-collected demonstrations without interactions with the
environment. However, imitating behaviors fully offline typically requires
numerous expert data. To tackle this issue, we study the setting where we have
limited expert data and supplementary suboptimal data. In this case, a
well-known issue is the distribution shift between the learned policy and the
behavior policy that collects the offline data. Prior works mitigate this issue
by regularizing the KL divergence between the stationary state-action
distributions of the learned policy and the behavior policy. We argue that such
constraints based on exact distribution matching can be overly conservative and
hamper policy learning, especially when the imperfect offline data is highly
suboptimal. To resolve this issue, we present RelaxDICE, which employs an
asymmetrically-relaxed f-divergence for explicit support regularization.
Specifically, instead of driving the learned policy to exactly match the
behavior policy, we impose little penalty whenever the density ratio between
their stationary state-action distributions is upper bounded by a constant.
Note that such formulation leads to a nested min-max optimization problem,
which causes instability in practice. RelaxDICE addresses this challenge by
supporting a closed-form solution for the inner maximization problem. Extensive
empirical study shows that our method significantly outperforms the best prior
offline IL method in six standard continuous control environments with over 30%
performance gain on average, across 22 settings where the imperfect dataset is
highly suboptimal.
Related papers
- Adaptive Advantage-Guided Policy Regularization for Offline Reinforcement Learning [12.112619241073158]
In offline reinforcement learning, the challenge of out-of-distribution is pronounced.
Existing methods often constrain the learned policy through policy regularization.
We propose Adaptive Advantage-guided Policy Regularization (A2PR)
arXiv Detail & Related papers (2024-05-30T10:20:55Z) - Hundreds Guide Millions: Adaptive Offline Reinforcement Learning with
Expert Guidance [74.31779732754697]
We propose a novel plug-in approach named Guided Offline RL (GORL)
GORL employs a guiding network, along with only a few expert demonstrations, to adaptively determine the relative importance of the policy improvement and policy constraint for every sample.
Experiments on various environments suggest that GORL can be easily installed on most offline RL algorithms with statistically significant performance improvements.
arXiv Detail & Related papers (2023-09-04T08:59:04Z) - Efficient Policy Evaluation with Offline Data Informed Behavior Policy Design [18.326126953667842]
We propose novel methods that improve the data efficiency of online Monte Carlo estimators.
We first propose a tailored closed-form behavior policy that provably reduces the variance of an online Monte Carlo estimator.
We then design efficient algorithms to learn this closed-form behavior policy from previously collected offline data.
arXiv Detail & Related papers (2023-01-31T16:12:31Z) - Offline RL With Realistic Datasets: Heteroskedasticity and Support
Constraints [82.43359506154117]
We show that typical offline reinforcement learning methods fail to learn from data with non-uniform variability.
Our method is simple, theoretically motivated, and improves performance across a wide range of offline RL problems in Atari games, navigation, and pixel-based manipulation.
arXiv Detail & Related papers (2022-11-02T11:36:06Z) - Mutual Information Regularized Offline Reinforcement Learning [76.05299071490913]
We propose a novel MISA framework to approach offline RL from the perspective of Mutual Information between States and Actions in the dataset.
We show that optimizing this lower bound is equivalent to maximizing the likelihood of a one-step improved policy on the offline dataset.
We introduce 3 different variants of MISA, and empirically demonstrate that tighter mutual information lower bound gives better offline RL performance.
arXiv Detail & Related papers (2022-10-14T03:22:43Z) - Latent-Variable Advantage-Weighted Policy Optimization for Offline RL [70.01851346635637]
offline reinforcement learning methods hold the promise of learning policies from pre-collected datasets without the need to query the environment for new transitions.
In practice, offline datasets are often heterogeneous, i.e., collected in a variety of scenarios.
We propose to leverage latent-variable policies that can represent a broader class of policy distributions.
Our method improves the average performance of the next best-performing offline reinforcement learning methods by 49% on heterogeneous datasets.
arXiv Detail & Related papers (2022-03-16T21:17:03Z) - Curriculum Offline Imitation Learning [72.1015201041391]
offline reinforcement learning tasks require the agent to learn from a pre-collected dataset with no further interactions with the environment.
We propose textitCurriculum Offline Learning (COIL), which utilizes an experience picking strategy for imitating from adaptive neighboring policies with a higher return.
On continuous control benchmarks, we compare COIL against both imitation-based and RL-based methods, showing that it not only avoids just learning a mediocre behavior on mixed datasets but is also even competitive with state-of-the-art offline RL methods.
arXiv Detail & Related papers (2021-11-03T08:02:48Z) - BRAC+: Improved Behavior Regularized Actor Critic for Offline
Reinforcement Learning [14.432131909590824]
Offline Reinforcement Learning aims to train effective policies using previously collected datasets.
Standard off-policy RL algorithms are prone to overestimations of the values of out-of-distribution (less explored) actions.
We improve the behavior regularized offline reinforcement learning and propose BRAC+.
arXiv Detail & Related papers (2021-10-02T23:55:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.