On Sample-Efficient Offline Reinforcement Learning: Data Diversity,
Posterior Sampling, and Beyond
- URL: http://arxiv.org/abs/2401.03301v2
- Date: Tue, 6 Feb 2024 18:08:40 GMT
- Title: On Sample-Efficient Offline Reinforcement Learning: Data Diversity,
Posterior Sampling, and Beyond
- Authors: Thanh Nguyen-Tang and Raman Arora
- Abstract summary: We propose a notion of data diversity that subsumes the previous notions of coverage measures in offline RL.
Our proposed model-free PS-based algorithm for offline RL is novel, with sub-optimality bounds that are frequentist (i.e., worst-case) in nature.
- Score: 29.449446595110643
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We seek to understand what facilitates sample-efficient learning from
historical datasets for sequential decision-making, a problem that is popularly
known as offline reinforcement learning (RL). Further, we are interested in
algorithms that enjoy sample efficiency while leveraging (value) function
approximation. In this paper, we address these fundamental questions by (i)
proposing a notion of data diversity that subsumes the previous notions of
coverage measures in offline RL and (ii) using this notion to {unify} three
distinct classes of offline RL algorithms based on version spaces (VS),
regularized optimization (RO), and posterior sampling (PS). We establish that
VS-based, RO-based, and PS-based algorithms, under standard assumptions,
achieve \emph{comparable} sample efficiency, which recovers the
state-of-the-art sub-optimality bounds for finite and linear model classes with
the standard assumptions. This result is surprising, given that the prior work
suggested an unfavorable sample complexity of the RO-based algorithm compared
to the VS-based algorithm, whereas posterior sampling is rarely considered in
offline RL due to its explorative nature. Notably, our proposed model-free
PS-based algorithm for offline RL is {novel}, with sub-optimality bounds that
are {frequentist} (i.e., worst-case) in nature.
Related papers
- Iterative Preference Learning from Human Feedback: Bridging Theory and Practice for RLHF under KL-Constraint [56.74058752955209]
This paper studies the alignment process of generative models with Reinforcement Learning from Human Feedback (RLHF)
We first identify the primary challenges of existing popular methods like offline PPO and offline DPO as lacking in strategical exploration of the environment.
We propose efficient algorithms with finite-sample theoretical guarantees.
arXiv Detail & Related papers (2023-12-18T18:58:42Z) - Bridging Distributionally Robust Learning and Offline RL: An Approach to
Mitigate Distribution Shift and Partial Data Coverage [32.578787778183546]
offline reinforcement learning (RL) algorithms learn optimal polices using historical (offline) data.
One of the main challenges in offline RL is the distribution shift.
We propose two offline RL algorithms using the distributionally robust learning (DRL) framework.
arXiv Detail & Related papers (2023-10-27T19:19:30Z) - Beyond Uniform Sampling: Offline Reinforcement Learning with Imbalanced
Datasets [53.8218145723718]
offline policy learning is aimed at learning decision-making policies using existing datasets of trajectories without collecting additional data.
We argue that when a dataset is dominated by suboptimal trajectories, state-of-the-art offline RL algorithms do not substantially improve over the average return of trajectories in the dataset.
We present a realization of the sampling strategy and an algorithm that can be used as a plug-and-play module in standard offline RL algorithms.
arXiv Detail & Related papers (2023-10-06T17:58:14Z) - Offline Policy Optimization in RL with Variance Regularizaton [142.87345258222942]
We propose variance regularization for offline RL algorithms, using stationary distribution corrections.
We show that by using Fenchel duality, we can avoid double sampling issues for computing the gradient of the variance regularizer.
The proposed algorithm for offline variance regularization (OVAR) can be used to augment any existing offline policy optimization algorithms.
arXiv Detail & Related papers (2022-12-29T18:25:01Z) - Revisiting the Linear-Programming Framework for Offline RL with General
Function Approximation [24.577243536475233]
offline reinforcement learning (RL) concerns pursuing an optimal policy for sequential decision-making from a pre-collected dataset.
Recent theoretical progress has focused on developing sample-efficient offline RL algorithms with various relaxed assumptions on data coverage and function approximators.
We revisit the linear-programming framework for offline RL, and advance the existing results in several aspects.
arXiv Detail & Related papers (2022-12-28T15:28:12Z) - Optimal Conservative Offline RL with General Function Approximation via
Augmented Lagrangian [18.2080757218886]
offline reinforcement learning (RL) refers to decision-making from a previously-collected dataset of interactions.
We present the first set of offline RL algorithms that are statistically optimal and practical under general function approximation and single-policy concentrability.
arXiv Detail & Related papers (2022-11-01T19:28:48Z) - Distributionally Robust Offline Reinforcement Learning with Linear
Function Approximation [16.128778192359327]
We learn an RL agent with the historical data obtained from the source environment and optimize it to perform well in the perturbed one.
We prove our algorithm can achieve the suboptimality of $O(sqrtK)$ depending on the linear function dimension $d$.
arXiv Detail & Related papers (2022-09-14T13:17:59Z) - Pessimism in the Face of Confounders: Provably Efficient Offline Reinforcement Learning in Partially Observable Markov Decision Processes [99.26864533035454]
We study offline reinforcement learning (RL) in partially observable Markov decision processes.
We propose the underlineProxy variable underlinePessimistic underlinePolicy underlineOptimization (textttP3O) algorithm.
textttP3O is the first provably efficient offline RL algorithm for POMDPs with a confounded dataset.
arXiv Detail & Related papers (2022-05-26T19:13:55Z) - False Correlation Reduction for Offline Reinforcement Learning [115.11954432080749]
We propose falSe COrrelation REduction (SCORE) for offline RL, a practically effective and theoretically provable algorithm.
We empirically show that SCORE achieves the SoTA performance with 3.1x acceleration on various tasks in a standard benchmark (D4RL)
arXiv Detail & Related papers (2021-10-24T15:34:03Z) - COMBO: Conservative Offline Model-Based Policy Optimization [120.55713363569845]
Uncertainty estimation with complex models, such as deep neural networks, can be difficult and unreliable.
We develop a new model-based offline RL algorithm, COMBO, that regularizes the value function on out-of-support state-actions.
We find that COMBO consistently performs as well or better as compared to prior offline model-free and model-based methods.
arXiv Detail & Related papers (2021-02-16T18:50:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.