Characterizing Uniform Convergence in Offline Policy Evaluation via
model-based approach: Offline Learning, Task-Agnostic and Reward-Free
- URL: http://arxiv.org/abs/2105.06029v1
- Date: Thu, 13 May 2021 01:36:34 GMT
- Title: Characterizing Uniform Convergence in Offline Policy Evaluation via
model-based approach: Offline Learning, Task-Agnostic and Reward-Free
- Authors: Ming Yin, Yu-Xiang Wang
- Abstract summary: We study the statistical limits of uniform convergence for offline policy evaluation problems (uniform OPE for short) with model-based methods under episodic MDP setting.
Our main result establishes an episode complexity of $tildeO(H2/d_mepsilon2)$ for emphnear-empirically optimal policies for the MDPs with emphstationary transition.
- Score: 34.54294677335518
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study the statistical limits of uniform convergence for offline policy
evaluation (OPE) problems (uniform OPE for short) with model-based methods
under episodic MDP setting. Uniform OPE $\sup_\Pi|Q^\pi-\hat{Q}^\pi|<\epsilon$
(initiated by Yin et al. 2021) is a stronger measure than the point-wise (fixed
policy) OPE and ensures offline policy learning when $\Pi$ contains all
policies (we call it global policy class). In this paper, we establish an
$\Omega(H^2 S/d_m\epsilon^2)$ lower bound (over model-based family) for the
global uniform OPE, where $d_m$ is the minimal state-action distribution
induced by the behavior policy. The order $S/d_m\epsilon^2$ reveals global
uniform OPE task is intrinsically harder than offline policy learning due to
the extra $S$ factor. Next, our main result establishes an episode complexity
of $\tilde{O}(H^2/d_m\epsilon^2)$ for \emph{local} uniform convergence that
applies to all \emph{near-empirically optimal} policies for the MDPs with
\emph{stationary} transition. The result implies the optimal sample complexity
for offline learning and separates local uniform OPE from the global case.
Paramountly, the model-based method combining with our new analysis technique
(singleton absorbing MDP) can be adapted to the new settings: offline
task-agnostic and the offline reward-free with optimal complexity
$\tilde{O}(H^2\log(K)/d_m\epsilon^2)$ ($K$ is the number of tasks) and
$\tilde{O}(H^2S/d_m\epsilon^2)$ respectively, which provides a unified
framework for simultaneously solving different offline RL problems.
Related papers
- Projection by Convolution: Optimal Sample Complexity for Reinforcement Learning in Continuous-Space MDPs [56.237917407785545]
We consider the problem of learning an $varepsilon$-optimal policy in a general class of continuous-space Markov decision processes (MDPs) having smooth Bellman operators.
Key to our solution is a novel projection technique based on ideas from harmonic analysis.
Our result bridges the gap between two popular but conflicting perspectives on continuous-space MDPs.
arXiv Detail & Related papers (2024-05-10T09:58:47Z) - Span-Based Optimal Sample Complexity for Weakly Communicating and General Average Reward MDPs [6.996002801232415]
We study the sample complexity of learning an $varepsilon$-optimal policy in an average-reward Markov decision process (MDP) under a generative model.
For weakly communicating MDPs, we establish the complexity bound $widetildeO(SAfracHvarepsilon2 )$, where $H$ is the span of the bias function of the optimal policy and $SA$ is the cardinality of the state-action space.
arXiv Detail & Related papers (2024-03-18T04:52:11Z) - Near Sample-Optimal Reduction-based Policy Learning for Average Reward
MDP [58.13930707612128]
This work considers the sample complexity of obtaining an $varepsilon$-optimal policy in an average reward Markov Decision Process (AMDP)
We prove an upper bound of $widetilde O(H varepsilon-3 ln frac1delta)$ samples per state-action pair, where $H := sp(h*)$ is the span of bias of any optimal policy, $varepsilon$ is the accuracy and $delta$ is the failure probability.
arXiv Detail & Related papers (2022-12-01T15:57:58Z) - Reward-Mixing MDPs with a Few Latent Contexts are Learnable [75.17357040707347]
We consider episodic reinforcement learning in reward-mixing Markov decision processes (RMMDPs)
Our goal is to learn a near-optimal policy that nearly maximizes the $H$ time-step cumulative rewards in such a model.
arXiv Detail & Related papers (2022-10-05T22:52:00Z) - Towards Instance-Optimal Offline Reinforcement Learning with Pessimism [34.54294677335518]
We study the offline reinforcement learning (offline RL) problem, where the goal is to learn a reward-maximizing policy in an unknown Markov Decision Process (MDP)
In this work, we analyze the Adaptive Pessimistic Value Iteration (APVI) algorithm and derive the suboptimality upper bound that nearly matches [ Oleft(sum_h=1Hsum_s_h,a_hdpistar_h(s_h,a_h)sqrtfracmathrmmathrmVar_
arXiv Detail & Related papers (2021-10-17T01:21:52Z) - Policy Finetuning: Bridging Sample-Efficient Offline and Online
Reinforcement Learning [59.02541753781001]
This paper initiates the theoretical study of policy finetuning, that is, online RL where the learner has additional access to a "reference policy"
We first design a sharp offline reduction algorithm that finds an $varepsilon$ near-optimal policy within $widetildeO(H3SCstar/varepsilon2)$ episodes.
We then establish an $Omega(H3SminCstar, A/varepsilon2)$ sample complexity lower bound for any policy finetuning algorithm, including those that can adaptively explore the
arXiv Detail & Related papers (2021-06-09T08:28:55Z) - Nearly Horizon-Free Offline Reinforcement Learning [97.36751930393245]
We revisit offline reinforcement learning on episodic time-homogeneous Markov Decision Processes with $S$ states, $A$ actions and planning horizon $H$.
We obtain the first set of nearly $H$-free sample complexity bounds for evaluation and planning using the empirical MDPs.
arXiv Detail & Related papers (2021-03-25T18:52:17Z) - Nearly Minimax Optimal Reward-free Reinforcement Learning [88.75843804630772]
We study the reward-free reinforcement learning framework, which is particularly suitable for batch reinforcement learning and scenarios where one needs policies for multiple reward functions.
We give a new efficient algorithm, textbfStaged textbfSampling + textbfTruncated textbfPlanning (algoname), which interacts with the environment at most $Oleft( fracS2Aepsilon2textpolylogleft(fracSAHepsilon2
arXiv Detail & Related papers (2020-10-12T17:51:19Z) - Near-Optimal Provable Uniform Convergence in Offline Policy Evaluation
for Reinforcement Learning [43.61029925616256]
offline policy evaluation in Reinforcement Learning (RL) is a critical step towards applying RL in real-life applications.
We address this problem by simultaneously evaluating all policies in a policy class $Pi$ -- uniform convergence in OPE.
Our results imply that the model-based planning achieves an optimal episode complexity of $widetildeO(H3/d_mepsilon2)$ in identifying an $epsilon$-optimal policy.
arXiv Detail & Related papers (2020-07-07T19:44:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.