Towards Instance-Optimal Offline Reinforcement Learning with Pessimism
- URL: http://arxiv.org/abs/2110.08695v1
- Date: Sun, 17 Oct 2021 01:21:52 GMT
- Title: Towards Instance-Optimal Offline Reinforcement Learning with Pessimism
- Authors: Ming Yin and Yu-Xiang Wang
- Abstract summary: We study the offline reinforcement learning (offline RL) problem, where the goal is to learn a reward-maximizing policy in an unknown Markov Decision Process (MDP)
In this work, we analyze the Adaptive Pessimistic Value Iteration (APVI) algorithm and derive the suboptimality upper bound that nearly matches [ Oleft(sum_h=1Hsum_s_h,a_hdpistar_h(s_h,a_h)sqrtfracmathrmmathrmVar_
- Score: 34.54294677335518
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study the offline reinforcement learning (offline RL) problem, where the
goal is to learn a reward-maximizing policy in an unknown Markov Decision
Process (MDP) using the data coming from a policy $\mu$. In particular, we
consider the sample complexity problems of offline RL for finite-horizon MDPs.
Prior works study this problem based on different data-coverage assumptions,
and their learning guarantees are expressed by the covering coefficients which
lack the explicit characterization of system quantities. In this work, we
analyze the Adaptive Pessimistic Value Iteration (APVI) algorithm and derive
the suboptimality upper bound that nearly matches \[
O\left(\sum_{h=1}^H\sum_{s_h,a_h}d^{\pi^\star}_h(s_h,a_h)\sqrt{\frac{\mathrm{Var}_{P_{s_h,a_h}}{(V^\star_{h+1}+r_h)}}{d^\mu_h(s_h,a_h)}}\sqrt{\frac{1}{n}}\right).
\] In complementary, we also prove a per-instance information-theoretical lower
bound under the weak assumption that $d^\mu_h(s_h,a_h)>0$ if
$d^{\pi^\star}_h(s_h,a_h)>0$. Different from the previous minimax lower bounds,
the per-instance lower bound (via local minimaxity) is a much stronger
criterion as it applies to individual instances separately. Here $\pi^\star$ is
a optimal policy, $\mu$ is the behavior policy and $d_h^\mu$ is the marginal
state-action probability. We call the above equation the intrinsic offline
reinforcement learning bound since it directly implies all the existing optimal
results: minimax rate under uniform data-coverage assumption, horizon-free
setting, single policy concentrability, and the tight problem-dependent
results. Later, we extend the result to the assumption-free regime (where we
make no assumption on $ \mu$) and obtain the assumption-free intrinsic bound.
Due to its generic form, we believe the intrinsic bound could help illuminate
what makes a specific problem hard and reveal the fundamental challenges in
offline RL.
Related papers
- Order-Optimal Instance-Dependent Bounds for Offline Reinforcement Learning with Preference Feedback [56.6950165117658]
We consider offline reinforcement learning with preference feedback in which the implicit reward is a linear function of an unknown parameter.
We propose an algorithm, underlineRL with underlineLocally underlineOptimal underlineWeights or sc RL-LOW, which yields a simple regret of $exp.
We observe that the lower and upper bounds on the simple regret match order-wise in the exponent, demonstrating order-wise optimality of sc RL-LOW.
arXiv Detail & Related papers (2024-06-18T02:03:12Z) - Sharp Rates in Dependent Learning Theory: Avoiding Sample Size Deflation for the Square Loss [33.18537822803389]
We show that whenever the topologies of $L2$ and $Psi_p$ are comparable on our hypothesis class $mathscrF$, $mathscrF$ is a weakly sub-Gaussian class.
Our result holds whether the problem is realizable or not and we refer to this as a emphnear mixing-free rate, since direct dependence on mixing is relegated to an additive higher order term.
arXiv Detail & Related papers (2024-02-08T18:57:42Z) - Estimating Optimal Policy Value in General Linear Contextual Bandits [50.008542459050155]
In many bandit problems, the maximal reward achievable by a policy is often unknown in advance.
We consider the problem of estimating the optimal policy value in the sublinear data regime before the optimal policy is even learnable.
We present a more practical, computationally efficient algorithm that estimates a problem-dependent upper bound on $V*$.
arXiv Detail & Related papers (2023-02-19T01:09:24Z) - Near Sample-Optimal Reduction-based Policy Learning for Average Reward
MDP [58.13930707612128]
This work considers the sample complexity of obtaining an $varepsilon$-optimal policy in an average reward Markov Decision Process (AMDP)
We prove an upper bound of $widetilde O(H varepsilon-3 ln frac1delta)$ samples per state-action pair, where $H := sp(h*)$ is the span of bias of any optimal policy, $varepsilon$ is the accuracy and $delta$ is the failure probability.
arXiv Detail & Related papers (2022-12-01T15:57:58Z) - Best Policy Identification in Linear MDPs [70.57916977441262]
We investigate the problem of best identification in discounted linear Markov+Delta Decision in the fixed confidence setting under a generative model.
The lower bound as the solution of an intricate non- optimization program can be used as the starting point to devise such algorithms.
arXiv Detail & Related papers (2022-08-11T04:12:50Z) - Settling the Sample Complexity of Model-Based Offline Reinforcement
Learning [50.5790774201146]
offline reinforcement learning (RL) learns using pre-collected data without further exploration.
Prior algorithms or analyses either suffer from suboptimal sample complexities or incur high burn-in cost to reach sample optimality.
We demonstrate that the model-based (or "plug-in") approach achieves minimax-optimal sample complexity without burn-in cost.
arXiv Detail & Related papers (2022-04-11T17:26:19Z) - Policy Finetuning: Bridging Sample-Efficient Offline and Online
Reinforcement Learning [59.02541753781001]
This paper initiates the theoretical study of policy finetuning, that is, online RL where the learner has additional access to a "reference policy"
We first design a sharp offline reduction algorithm that finds an $varepsilon$ near-optimal policy within $widetildeO(H3SCstar/varepsilon2)$ episodes.
We then establish an $Omega(H3SminCstar, A/varepsilon2)$ sample complexity lower bound for any policy finetuning algorithm, including those that can adaptively explore the
arXiv Detail & Related papers (2021-06-09T08:28:55Z) - Characterizing Uniform Convergence in Offline Policy Evaluation via
model-based approach: Offline Learning, Task-Agnostic and Reward-Free [34.54294677335518]
We study the statistical limits of uniform convergence for offline policy evaluation problems (uniform OPE for short) with model-based methods under episodic MDP setting.
Our main result establishes an episode complexity of $tildeO(H2/d_mepsilon2)$ for emphnear-empirically optimal policies for the MDPs with emphstationary transition.
arXiv Detail & Related papers (2021-05-13T01:36:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.