Is Pessimism Provably Efficient for Offline RL?
- URL: http://arxiv.org/abs/2012.15085v1
- Date: Wed, 30 Dec 2020 09:06:57 GMT
- Title: Is Pessimism Provably Efficient for Offline RL?
- Authors: Ying Jin, Zhuoran Yang, Zhaoran Wang
- Abstract summary: We study offline reinforcement learning (RL), which aims to learn an optimal policy based on a dataset collected a priori.
We propose a pessimistic variant of the value iteration algorithm (PEVI), which incorporates an uncertainty quantifier as the penalty function.
- Score: 104.00628430454479
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study offline reinforcement learning (RL), which aims to learn an optimal
policy based on a dataset collected a priori. Due to the lack of further
interactions with the environment, offline RL suffers from the insufficient
coverage of the dataset, which eludes most existing theoretical analysis. In
this paper, we propose a pessimistic variant of the value iteration algorithm
(PEVI), which incorporates an uncertainty quantifier as the penalty function.
Such a penalty function simply flips the sign of the bonus function for
promoting exploration in online RL, which makes it easily implementable and
compatible with general function approximators.
Without assuming the sufficient coverage of the dataset, we establish a
data-dependent upper bound on the suboptimality of PEVI for general Markov
decision processes (MDPs). When specialized to linear MDPs, it matches the
information-theoretic lower bound up to multiplicative factors of the dimension
and horizon. In other words, pessimism is not only provably efficient but also
minimax optimal. In particular, given the dataset, the learned policy serves as
the ``best effort'' among all policies, as no other policies can do better. Our
theoretical analysis identifies the critical role of pessimism in eliminating a
notion of spurious correlation, which emerges from the ``irrelevant''
trajectories that are less covered by the dataset and not informative for the
optimal policy.
Related papers
- Preference Fine-Tuning of LLMs Should Leverage Suboptimal, On-Policy Data [102.16105233826917]
Learning from preference labels plays a crucial role in fine-tuning large language models.
There are several distinct approaches for preference fine-tuning, including supervised learning, on-policy reinforcement learning (RL), and contrastive learning.
arXiv Detail & Related papers (2024-04-22T17:20:18Z) - Beyond Uniform Sampling: Offline Reinforcement Learning with Imbalanced
Datasets [53.8218145723718]
offline policy learning is aimed at learning decision-making policies using existing datasets of trajectories without collecting additional data.
We argue that when a dataset is dominated by suboptimal trajectories, state-of-the-art offline RL algorithms do not substantially improve over the average return of trajectories in the dataset.
We present a realization of the sampling strategy and an algorithm that can be used as a plug-and-play module in standard offline RL algorithms.
arXiv Detail & Related papers (2023-10-06T17:58:14Z) - Offline Reinforcement Learning with Additional Covering Distributions [0.0]
We study learning optimal policies from a logged dataset, i.e., offline RL, with function approximation.
We show that sample-efficient offline RL for general MDPs is possible with only a partial coverage dataset and weak realizable function classes.
arXiv Detail & Related papers (2023-05-22T03:31:03Z) - Importance Weighted Actor-Critic for Optimal Conservative Offline
Reinforcement Learning [23.222448307481073]
We propose a new practical algorithm for offline reinforcement learning (RL) in complex environments with insufficient data coverage.
Our algorithm combines the marginalized importance sampling framework with the actor-critic paradigm.
We provide both theoretical analysis and experimental results to validate the effectiveness of our proposed algorithm.
arXiv Detail & Related papers (2023-01-30T07:53:53Z) - Policy learning "without" overlap: Pessimism and generalized empirical Bernstein's inequality [94.89246810243053]
This paper studies offline policy learning, which aims at utilizing observations collected a priori to learn an optimal individualized decision rule.
Existing policy learning methods rely on a uniform overlap assumption, i.e., the propensities of exploring all actions for all individual characteristics must be lower bounded.
We propose Pessimistic Policy Learning (PPL), a new algorithm that optimize lower confidence bounds (LCBs) instead of point estimates.
arXiv Detail & Related papers (2022-12-19T22:43:08Z) - Pessimism in the Face of Confounders: Provably Efficient Offline Reinforcement Learning in Partially Observable Markov Decision Processes [99.26864533035454]
We study offline reinforcement learning (RL) in partially observable Markov decision processes.
We propose the underlineProxy variable underlinePessimistic underlinePolicy underlineOptimization (textttP3O) algorithm.
textttP3O is the first provably efficient offline RL algorithm for POMDPs with a confounded dataset.
arXiv Detail & Related papers (2022-05-26T19:13:55Z) - Pessimistic Minimax Value Iteration: Provably Efficient Equilibrium
Learning from Offline Datasets [101.5329678997916]
We study episodic two-player zero-sum Markov games (MGs) in the offline setting.
The goal is to find an approximate Nash equilibrium (NE) policy pair based on a dataset collected a priori.
arXiv Detail & Related papers (2022-02-15T15:39:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.