Harnessing Mixed Offline Reinforcement Learning Datasets via Trajectory
Weighting
- URL: http://arxiv.org/abs/2306.13085v1
- Date: Thu, 22 Jun 2023 17:58:02 GMT
- Title: Harnessing Mixed Offline Reinforcement Learning Datasets via Trajectory
Weighting
- Authors: Zhang-Wei Hong, Pulkit Agrawal, R\'emi Tachet des Combes, Romain
Laroche
- Abstract summary: We show that state-of-the-art offline RL algorithms are overly restrained by low-return trajectories and fail to exploit trajectories to the fullest.
This reweighted sampling strategy may be combined with any offline RL algorithm.
We empirically show that while CQL, IQL, and TD3+BC achieve only a part of this potential policy improvement, these same algorithms fully exploit the dataset.
- Score: 29.21380944341589
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Most offline reinforcement learning (RL) algorithms return a target policy
maximizing a trade-off between (1) the expected performance gain over the
behavior policy that collected the dataset, and (2) the risk stemming from the
out-of-distribution-ness of the induced state-action occupancy. It follows that
the performance of the target policy is strongly related to the performance of
the behavior policy and, thus, the trajectory return distribution of the
dataset. We show that in mixed datasets consisting of mostly low-return
trajectories and minor high-return trajectories, state-of-the-art offline RL
algorithms are overly restrained by low-return trajectories and fail to exploit
high-performing trajectories to the fullest. To overcome this issue, we show
that, in deterministic MDPs with stochastic initial states, the dataset
sampling can be re-weighted to induce an artificial dataset whose behavior
policy has a higher return. This re-weighted sampling strategy may be combined
with any offline RL algorithm. We further analyze that the opportunity for
performance improvement over the behavior policy correlates with the
positive-sided variance of the returns of the trajectories in the dataset. We
empirically show that while CQL, IQL, and TD3+BC achieve only a part of this
potential policy improvement, these same algorithms combined with our
reweighted sampling strategy fully exploit the dataset. Furthermore, we
empirically demonstrate that, despite its theoretical limitation, the approach
may still be efficient in stochastic environments. The code is available at
https://github.com/Improbable-AI/harness-offline-rl.
Related papers
- CDSA: Conservative Denoising Score-based Algorithm for Offline Reinforcement Learning [25.071018803326254]
Distribution shift is a major obstacle in offline reinforcement learning.
Previous conservative offline RL algorithms struggle to generalize to unseen actions.
We propose to use the gradient fields of the dataset density generated from a pre-trained offline RL algorithm to adjust the original actions.
arXiv Detail & Related papers (2024-06-11T17:59:29Z) - Beyond Uniform Sampling: Offline Reinforcement Learning with Imbalanced
Datasets [53.8218145723718]
offline policy learning is aimed at learning decision-making policies using existing datasets of trajectories without collecting additional data.
We argue that when a dataset is dominated by suboptimal trajectories, state-of-the-art offline RL algorithms do not substantially improve over the average return of trajectories in the dataset.
We present a realization of the sampling strategy and an algorithm that can be used as a plug-and-play module in standard offline RL algorithms.
arXiv Detail & Related papers (2023-10-06T17:58:14Z) - Offline Policy Evaluation for Reinforcement Learning with Adaptively Collected Data [28.445166861907495]
We develop theory for the TMIS Offline Policy Evaluation (OPE) estimator.
We derive high-probability, instance-dependent bounds on its estimation error.
We also recover minimax-optimal offline learning in the adaptive setting.
arXiv Detail & Related papers (2023-06-24T21:48:28Z) - Offline Policy Optimization in RL with Variance Regularizaton [142.87345258222942]
We propose variance regularization for offline RL algorithms, using stationary distribution corrections.
We show that by using Fenchel duality, we can avoid double sampling issues for computing the gradient of the variance regularizer.
The proposed algorithm for offline variance regularization (OVAR) can be used to augment any existing offline policy optimization algorithms.
arXiv Detail & Related papers (2022-12-29T18:25:01Z) - Offline Reinforcement Learning with Adaptive Behavior Regularization [1.491109220586182]
offline reinforcement learning (RL) defines a sample-efficient learning paradigm, where a policy is learned from static and previously collected datasets.
We propose a novel approach, which we refer to as adaptive behavior regularization (ABR)
ABR enables the policy to adaptively adjust its optimization objective between cloning and improving over the policy used to generate the dataset.
arXiv Detail & Related papers (2022-11-15T15:59:11Z) - Boosting Offline Reinforcement Learning via Data Rebalancing [104.3767045977716]
offline reinforcement learning (RL) is challenged by the distributional shift between learning policies and datasets.
We propose a simple yet effective method to boost offline RL algorithms based on the observation that resampling a dataset keeps the distribution support unchanged.
We dub our method ReD (Return-based Data Rebalance), which can be implemented with less than 10 lines of code change and adds negligible running time.
arXiv Detail & Related papers (2022-10-17T16:34:01Z) - Pessimism in the Face of Confounders: Provably Efficient Offline Reinforcement Learning in Partially Observable Markov Decision Processes [99.26864533035454]
We study offline reinforcement learning (RL) in partially observable Markov decision processes.
We propose the underlineProxy variable underlinePessimistic underlinePolicy underlineOptimization (textttP3O) algorithm.
textttP3O is the first provably efficient offline RL algorithm for POMDPs with a confounded dataset.
arXiv Detail & Related papers (2022-05-26T19:13:55Z) - Latent-Variable Advantage-Weighted Policy Optimization for Offline RL [70.01851346635637]
offline reinforcement learning methods hold the promise of learning policies from pre-collected datasets without the need to query the environment for new transitions.
In practice, offline datasets are often heterogeneous, i.e., collected in a variety of scenarios.
We propose to leverage latent-variable policies that can represent a broader class of policy distributions.
Our method improves the average performance of the next best-performing offline reinforcement learning methods by 49% on heterogeneous datasets.
arXiv Detail & Related papers (2022-03-16T21:17:03Z) - Understanding the Effects of Dataset Characteristics on Offline
Reinforcement Learning [4.819336169151637]
Offline Reinforcement Learning can learn policies from a given dataset without interacting with the environment.
We show how dataset characteristics influence the performance of Offline RL algorithms for discrete action environments.
For datasets with high TQ, Behavior Cloning outperforms or performs similarly to the best Offline RL algorithms.
arXiv Detail & Related papers (2021-11-08T18:48:43Z) - Critic Regularized Regression [70.8487887738354]
We propose a novel offline RL algorithm to learn policies from data using a form of critic-regularized regression (CRR)
We find that CRR performs surprisingly well and scales to tasks with high-dimensional state and action spaces.
arXiv Detail & Related papers (2020-06-26T17:50:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.