On Instance-Dependent Bounds for Offline Reinforcement Learning with
Linear Function Approximation
- URL: http://arxiv.org/abs/2211.13208v1
- Date: Wed, 23 Nov 2022 18:50:44 GMT
- Title: On Instance-Dependent Bounds for Offline Reinforcement Learning with
Linear Function Approximation
- Authors: Thanh Nguyen-Tang, Ming Yin, Sunil Gupta, Svetha Venkatesh, Raman
Arora
- Abstract summary: We present an algorithm called Bootstrapped and Constrained Pessimistic Value Iteration (BCP-VI)
Under a partial data coverage assumption, BCP-VI yields a fast rate of $tildemathcalO(frac1K)$ for offline RL when there is a positive gap in the optimal Q-value functions.
These are the first $tildemathcalO(frac1K)$ bound and absolute zero sub-optimality bound respectively for offline RL with linear function approximation from adaptive data.
- Score: 80.86358123230757
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Sample-efficient offline reinforcement learning (RL) with linear function
approximation has recently been studied extensively. Much of prior work has
yielded the minimax-optimal bound of $\tilde{\mathcal{O}}(\frac{1}{\sqrt{K}})$,
with $K$ being the number of episodes in the offline data. In this work, we
seek to understand instance-dependent bounds for offline RL with function
approximation. We present an algorithm called Bootstrapped and Constrained
Pessimistic Value Iteration (BCP-VI), which leverages data bootstrapping and
constrained optimization on top of pessimism. We show that under a partial data
coverage assumption, that of \emph{concentrability} with respect to an optimal
policy, the proposed algorithm yields a fast rate of
$\tilde{\mathcal{O}}(\frac{1}{K})$ for offline RL when there is a positive gap
in the optimal Q-value functions, even when the offline data were adaptively
collected. Moreover, when the linear features of the optimal actions in the
states reachable by an optimal policy span those reachable by the behavior
policy and the optimal actions are unique, offline RL achieves absolute zero
sub-optimality error when $K$ exceeds a (finite) instance-dependent threshold.
To the best of our knowledge, these are the first
$\tilde{\mathcal{O}}(\frac{1}{K})$ bound and absolute zero sub-optimality bound
respectively for offline RL with linear function approximation from adaptive
data with partial coverage. We also provide instance-agnostic and
instance-dependent information-theoretical lower bounds to complement our upper
bounds.
Related papers
- Order-Optimal Instance-Dependent Bounds for Offline Reinforcement Learning with Preference Feedback [56.6950165117658]
We consider offline reinforcement learning with preference feedback in which the implicit reward is a linear function of an unknown parameter.
We propose an algorithm, underlineRL with underlineLocally underlineOptimal underlineWeights or sc RL-LOW, which yields a simple regret of $exp.
We observe that the lower and upper bounds on the simple regret match order-wise in the exponent, demonstrating order-wise optimality of sc RL-LOW.
arXiv Detail & Related papers (2024-06-18T02:03:12Z) - Pessimistic Nonlinear Least-Squares Value Iteration for Offline Reinforcement Learning [53.97335841137496]
We propose an oracle-efficient algorithm, dubbed Pessimistic Least-Square Value Iteration (PNLSVI) for offline RL with non-linear function approximation.
Our algorithm enjoys a regret bound that has a tight dependency on the function class complexity and achieves minimax optimal instance-dependent regret when specialized to linear function approximation.
arXiv Detail & Related papers (2023-10-02T17:42:01Z) - Offline Primal-Dual Reinforcement Learning for Linear MDPs [16.782625445546273]
Offline Reinforcement Learning (RL) aims to learn a near-optimal policy from a fixed dataset of transitions collected by another policy.
This paper proposes a primal-dual optimization method based on the linear programming formulation of RL.
arXiv Detail & Related papers (2023-05-22T11:45:23Z) - Revisiting the Linear-Programming Framework for Offline RL with General
Function Approximation [24.577243536475233]
offline reinforcement learning (RL) concerns pursuing an optimal policy for sequential decision-making from a pre-collected dataset.
Recent theoretical progress has focused on developing sample-efficient offline RL algorithms with various relaxed assumptions on data coverage and function approximators.
We revisit the linear-programming framework for offline RL, and advance the existing results in several aspects.
arXiv Detail & Related papers (2022-12-28T15:28:12Z) - Pessimism in the Face of Confounders: Provably Efficient Offline Reinforcement Learning in Partially Observable Markov Decision Processes [99.26864533035454]
We study offline reinforcement learning (RL) in partially observable Markov decision processes.
We propose the underlineProxy variable underlinePessimistic underlinePolicy underlineOptimization (textttP3O) algorithm.
textttP3O is the first provably efficient offline RL algorithm for POMDPs with a confounded dataset.
arXiv Detail & Related papers (2022-05-26T19:13:55Z) - OptiDICE: Offline Policy Optimization via Stationary Distribution
Correction Estimation [59.469401906712555]
We present an offline reinforcement learning algorithm that prevents overestimation in a more principled way.
Our algorithm, OptiDICE, directly estimates the stationary distribution corrections of the optimal policy.
We show that OptiDICE performs competitively with the state-of-the-art methods.
arXiv Detail & Related papers (2021-06-21T00:43:30Z) - Near-Optimal Offline Reinforcement Learning via Double Variance
Reduction [36.027428493021716]
Off-Policy Double Variance Reduction is a new variance reduction based algorithm for offline RL.
We show that OPDVR provably identifies an $epsilon$-optimal policy with $widetildeO(H2/d_mepsilon2)$ episodes of offline data.
We also show that OPDVR also achieves rate-optimal sample complexity under alternative settings.
arXiv Detail & Related papers (2021-02-02T20:47:35Z) - Is Pessimism Provably Efficient for Offline RL? [104.00628430454479]
We study offline reinforcement learning (RL), which aims to learn an optimal policy based on a dataset collected a priori.
We propose a pessimistic variant of the value iteration algorithm (PEVI), which incorporates an uncertainty quantifier as the penalty function.
arXiv Detail & Related papers (2020-12-30T09:06:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.