Related papers: Characterizing Uniform Convergence in Offline Policy Evaluation via model-based approach: Offline Learning, Task-Agnostic and Reward-Free

Characterizing Uniform Convergence in Offline Policy Evaluation via model-based approach: Offline Learning, Task-Agnostic and Reward-Free

URL: http://arxiv.org/abs/2105.06029v1
Date: Thu, 13 May 2021 01:36:34 GMT
Title: Characterizing Uniform Convergence in Offline Policy Evaluation via model-based approach: Offline Learning, Task-Agnostic and Reward-Free
Authors: Ming Yin, Yu-Xiang Wang
Abstract summary: We study the statistical limits of uniform convergence for offline policy evaluation problems (uniform OPE for short) with model-based methods under episodic MDP setting. Our main result establishes an episode complexity of $tildeO(H2/d_mepsilon2)$ for emphnear-empirically optimal policies for the MDPs with emphstationary transition.
Score: 34.54294677335518
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We study the statistical limits of uniform convergence for offline policy evaluation (OPE) problems (uniform OPE for short) with model-based methods under episodic MDP setting. Uniform OPE $\sup_\Pi|Q^\pi-\hat{Q}^\pi|<\epsilon$ (initiated by Yin et al. 2021) is a stronger measure than the point-wise (fixed policy) OPE and ensures offline policy learning when $\Pi$ contains all policies (we call it global policy class). In this paper, we establish an $\Omega(H^2 S/d_m\epsilon^2)$ lower bound (over model-based family) for the global uniform OPE, where $d_m$ is the minimal state-action distribution induced by the behavior policy. The order $S/d_m\epsilon^2$ reveals global uniform OPE task is intrinsically harder than offline policy learning due to the extra $S$ factor. Next, our main result establishes an episode complexity of $\tilde{O}(H^2/d_m\epsilon^2)$ for \emph{local} uniform convergence that applies to all \emph{near-empirically optimal} policies for the MDPs with \emph{stationary} transition. The result implies the optimal sample complexity for offline learning and separates local uniform OPE from the global case. Paramountly, the model-based method combining with our new analysis technique (singleton absorbing MDP) can be adapted to the new settings: offline task-agnostic and the offline reward-free with optimal complexity $\tilde{O}(H^2\log(K)/d_m\epsilon^2)$ ($K$ is the number of tasks) and $\tilde{O}(H^2S/d_m\epsilon^2)$ respectively, which provides a unified framework for simultaneously solving different offline RL problems.

Related papers

Last-Iterate Convergence of General Parameterized Policies in Constrained MDPs [35.22742439337603]
Proposed Primal-Dual based Regularized Accelerated Natural Policy Gradient (PDR-ANPG) algorithm uses entropy and quadratic regularizers to reach this goal. For a parameterized policy class with transferred compatibility approximation error, PDR-ANPG achieves a last-iterate $epsilon$ optimality gap. It is a significant improvement of the state-of-the-art last-iterate guarantees of general parameterized CMDPs.
arXiv Detail & Related papers (2024-08-21T10:44:57Z)
Projection by Convolution: Optimal Sample Complexity for Reinforcement Learning in Continuous-Space MDPs [56.237917407785545]
We consider the problem of learning an $varepsilon$-optimal policy in a general class of continuous-space Markov decision processes (MDPs) having smooth Bellman operators. Key to our solution is a novel projection technique based on ideas from harmonic analysis. Our result bridges the gap between two popular but conflicting perspectives on continuous-space MDPs.
arXiv Detail & Related papers (2024-05-10T09:58:47Z)
Span-Based Optimal Sample Complexity for Weakly Communicating and General Average Reward MDPs [6.996002801232415]
We study the sample complexity of learning an $varepsilon$-optimal policy in an average-reward Markov decision process (MDP) under a generative model. For weakly communicating MDPs, we establish the complexity bound $widetildeO(SAfracHvarepsilon2 )$, where $H$ is the span of the bias function of the optimal policy and $SA$ is the cardinality of the state-action space.
arXiv Detail & Related papers (2024-03-18T04:52:11Z)
Near Sample-Optimal Reduction-based Policy Learning for Average Reward MDP [58.13930707612128]
This work considers the sample complexity of obtaining an $varepsilon$-optimal policy in an average reward Markov Decision Process (AMDP) We prove an upper bound of $widetilde O(H varepsilon-3 ln frac1delta)$ samples per state-action pair, where $H := sp(h*)$ is the span of bias of any optimal policy, $varepsilon$ is the accuracy and $delta$ is the failure probability.
arXiv Detail & Related papers (2022-12-01T15:57:58Z)
Reward-Mixing MDPs with a Few Latent Contexts are Learnable [75.17357040707347]
We consider episodic reinforcement learning in reward-mixing Markov decision processes (RMMDPs) Our goal is to learn a near-optimal policy that nearly maximizes the $H$ time-step cumulative rewards in such a model.
arXiv Detail & Related papers (2022-10-05T22:52:00Z)
Towards Instance-Optimal Offline Reinforcement Learning with Pessimism [34.54294677335518]
We study the offline reinforcement learning (offline RL) problem, where the goal is to learn a reward-maximizing policy in an unknown Markov Decision Process (MDP) In this work, we analyze the Adaptive Pessimistic Value Iteration (APVI) algorithm and derive the suboptimality upper bound that nearly matches [ Oleft(sum_h=1Hsum_s_h,a_hdpistar_h(s_h,a_h)sqrtfracmathrmmathrmVar_
arXiv Detail & Related papers (2021-10-17T01:21:52Z)
Reward-Free Model-Based Reinforcement Learning with Linear Function Approximation [92.99933928528797]
We study the model-based reward-free reinforcement learning with linear function approximation for episodic Markov decision processes (MDPs) In the planning phase, the agent is given a specific reward function and uses samples collected from the exploration phase to learn a good policy. We show that to obtain an $epsilon$-optimal policy for arbitrary reward function, UCRL-RFE needs to sample at most $tilde O(H4d(H + d)epsilon-2)$ episodes.
arXiv Detail & Related papers (2021-10-12T23:03:58Z)
Policy Finetuning: Bridging Sample-Efficient Offline and Online Reinforcement Learning [59.02541753781001]
This paper initiates the theoretical study of policy finetuning, that is, online RL where the learner has additional access to a "reference policy" We first design a sharp offline reduction algorithm that finds an $varepsilon$ near-optimal policy within $widetildeO(H3SCstar/varepsilon2)$ episodes. We then establish an $Omega(H3SminCstar, A/varepsilon2)$ sample complexity lower bound for any policy finetuning algorithm, including those that can adaptively explore the
arXiv Detail & Related papers (2021-06-09T08:28:55Z)
Nearly Horizon-Free Offline Reinforcement Learning [97.36751930393245]
We revisit offline reinforcement learning on episodic time-homogeneous Markov Decision Processes with $S$ states, $A$ actions and planning horizon $H$. We obtain the first set of nearly $H$-free sample complexity bounds for evaluation and planning using the empirical MDPs.
arXiv Detail & Related papers (2021-03-25T18:52:17Z)
Nearly Minimax Optimal Reward-free Reinforcement Learning [88.75843804630772]
We study the reward-free reinforcement learning framework, which is particularly suitable for batch reinforcement learning and scenarios where one needs policies for multiple reward functions. We give a new efficient algorithm, textbfStaged textbfSampling + textbfTruncated textbfPlanning (algoname), which interacts with the environment at most $Oleft( fracS2Aepsilon2textpolylogleft(fracSAHepsilon2
arXiv Detail & Related papers (2020-10-12T17:51:19Z)
Near-Optimal Provable Uniform Convergence in Offline Policy Evaluation for Reinforcement Learning [43.61029925616256]
offline policy evaluation in Reinforcement Learning (RL) is a critical step towards applying RL in real-life applications. We address this problem by simultaneously evaluating all policies in a policy class $Pi$ -- uniform convergence in OPE. Our results imply that the model-based planning achieves an optimal episode complexity of $widetildeO(H3/d_mepsilon2)$ in identifying an $epsilon$-optimal policy.
arXiv Detail & Related papers (2020-07-07T19:44:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.