Bridging Offline Reinforcement Learning and Imitation Learning: A Tale
of Pessimism
- URL: http://arxiv.org/abs/2103.12021v2
- Date: Mon, 3 Jul 2023 04:47:42 GMT
- Title: Bridging Offline Reinforcement Learning and Imitation Learning: A Tale
of Pessimism
- Authors: Paria Rashidinejad, Banghua Zhu, Cong Ma, Jiantao Jiao, Stuart Russell
- Abstract summary: offline reinforcement learning (RL) algorithms seek to learn an optimal policy from a fixed dataset without active data collection.
Based on the composition of the offline dataset, two main categories of methods are used: imitation learning and vanilla offline RL.
We present a new offline RL framework that smoothly interpolates between the two extremes of data composition.
- Score: 26.11003309805633
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Offline (or batch) reinforcement learning (RL) algorithms seek to learn an
optimal policy from a fixed dataset without active data collection. Based on
the composition of the offline dataset, two main categories of methods are
used: imitation learning which is suitable for expert datasets and vanilla
offline RL which often requires uniform coverage datasets. From a practical
standpoint, datasets often deviate from these two extremes and the exact data
composition is usually unknown a priori. To bridge this gap, we present a new
offline RL framework that smoothly interpolates between the two extremes of
data composition, hence unifying imitation learning and vanilla offline RL. The
new framework is centered around a weak version of the concentrability
coefficient that measures the deviation from the behavior policy to the expert
policy alone.
Under this new framework, we further investigate the question on algorithm
design: can one develop an algorithm that achieves a minimax optimal rate and
also adapts to unknown data composition? To address this question, we consider
a lower confidence bound (LCB) algorithm developed based on pessimism in the
face of uncertainty in offline RL. We study finite-sample properties of LCB as
well as information-theoretic limits in multi-armed bandits, contextual
bandits, and Markov decision processes (MDPs). Our analysis reveals surprising
facts about optimality rates. In particular, in all three settings, LCB
achieves a faster rate of $1/N$ for nearly-expert datasets compared to the
usual rate of $1/\sqrt{N}$ in offline RL, where $N$ is the number of samples in
the batch dataset. In the case of contextual bandits with at least two
contexts, we prove that LCB is adaptively optimal for the entire data
composition range, achieving a smooth transition from imitation learning to
offline RL. We further show that LCB is almost adaptively optimal in MDPs.
Related papers
- Bridging Distributionally Robust Learning and Offline RL: An Approach to
Mitigate Distribution Shift and Partial Data Coverage [32.578787778183546]
offline reinforcement learning (RL) algorithms learn optimal polices using historical (offline) data.
One of the main challenges in offline RL is the distribution shift.
We propose two offline RL algorithms using the distributionally robust learning (DRL) framework.
arXiv Detail & Related papers (2023-10-27T19:19:30Z) - Beyond Uniform Sampling: Offline Reinforcement Learning with Imbalanced
Datasets [53.8218145723718]
offline policy learning is aimed at learning decision-making policies using existing datasets of trajectories without collecting additional data.
We argue that when a dataset is dominated by suboptimal trajectories, state-of-the-art offline RL algorithms do not substantially improve over the average return of trajectories in the dataset.
We present a realization of the sampling strategy and an algorithm that can be used as a plug-and-play module in standard offline RL algorithms.
arXiv Detail & Related papers (2023-10-06T17:58:14Z) - Bridging Imitation and Online Reinforcement Learning: An Optimistic Tale [27.02990488317357]
Given an offline demonstration dataset from an imperfect expert, what is the best way to leverage it to bootstrap online learning performance in MDPs?
We first propose an Informed Posterior Sampling-based RL (iPSRL) algorithm that uses the offline dataset, and information about the expert's behavioral policy used to generate the offline dataset.
Since this algorithm is computationally impractical, we then propose the iRLSVI algorithm that can be seen as a combination of the RLSVI algorithm for online RL, and imitation learning.
arXiv Detail & Related papers (2023-03-20T18:16:25Z) - Pessimism in the Face of Confounders: Provably Efficient Offline Reinforcement Learning in Partially Observable Markov Decision Processes [99.26864533035454]
We study offline reinforcement learning (RL) in partially observable Markov decision processes.
We propose the underlineProxy variable underlinePessimistic underlinePolicy underlineOptimization (textttP3O) algorithm.
textttP3O is the first provably efficient offline RL algorithm for POMDPs with a confounded dataset.
arXiv Detail & Related papers (2022-05-26T19:13:55Z) - When Should We Prefer Offline Reinforcement Learning Over Behavioral
Cloning? [86.43517734716606]
offline reinforcement learning (RL) algorithms can acquire effective policies by utilizing previously collected experience, without any online interaction.
behavioral cloning (BC) algorithms mimic a subset of the dataset via supervised learning.
We show that policies trained on sufficiently noisy suboptimal data can attain better performance than even BC algorithms with expert data.
arXiv Detail & Related papers (2022-04-12T08:25:34Z) - Interpretable performance analysis towards offline reinforcement
learning: A dataset perspective [6.526790418943535]
We propose a two-fold taxonomy for existing offline RL algorithms.
We explore the correlation between the performance of different types of algorithms and the distribution of actions under states.
We create a benchmark platform on the Atari domain, entitled easy go (RLEG), at an estimated cost of more than 0.3 million dollars.
arXiv Detail & Related papers (2021-05-12T07:17:06Z) - Continuous Doubly Constrained Batch Reinforcement Learning [93.23842221189658]
We propose an algorithm for batch RL, where effective policies are learned using only a fixed offline dataset instead of online interactions with the environment.
The limited data in batch RL produces inherent uncertainty in value estimates of states/actions that were insufficiently represented in the training data.
We propose to mitigate this issue via two straightforward penalties: a policy-constraint to reduce this divergence and a value-constraint that discourages overly optimistic estimates.
arXiv Detail & Related papers (2021-02-18T08:54:14Z) - Is Pessimism Provably Efficient for Offline RL? [104.00628430454479]
We study offline reinforcement learning (RL), which aims to learn an optimal policy based on a dataset collected a priori.
We propose a pessimistic variant of the value iteration algorithm (PEVI), which incorporates an uncertainty quantifier as the penalty function.
arXiv Detail & Related papers (2020-12-30T09:06:57Z) - Critic Regularized Regression [70.8487887738354]
We propose a novel offline RL algorithm to learn policies from data using a form of critic-regularized regression (CRR)
We find that CRR performs surprisingly well and scales to tasks with high-dimensional state and action spaces.
arXiv Detail & Related papers (2020-06-26T17:50:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.