Near-Optimal Provable Uniform Convergence in Offline Policy Evaluation
for Reinforcement Learning
- URL: http://arxiv.org/abs/2007.03760v2
- Date: Tue, 1 Dec 2020 09:14:25 GMT
- Title: Near-Optimal Provable Uniform Convergence in Offline Policy Evaluation
for Reinforcement Learning
- Authors: Ming Yin, Yu Bai and Yu-Xiang Wang
- Abstract summary: offline policy evaluation in Reinforcement Learning (RL) is a critical step towards applying RL in real-life applications.
We address this problem by simultaneously evaluating all policies in a policy class $Pi$ -- uniform convergence in OPE.
Our results imply that the model-based planning achieves an optimal episode complexity of $widetildeO(H3/d_mepsilon2)$ in identifying an $epsilon$-optimal policy.
- Score: 43.61029925616256
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The problem of Offline Policy Evaluation (OPE) in Reinforcement Learning (RL)
is a critical step towards applying RL in real-life applications. Existing work
on OPE mostly focus on evaluating a fixed target policy $\pi$, which does not
provide useful bounds for offline policy learning as $\pi$ will then be
data-dependent. We address this problem by simultaneously evaluating all
policies in a policy class $\Pi$ -- uniform convergence in OPE -- and obtain
nearly optimal error bounds for a number of global / local policy classes. Our
results imply that the model-based planning achieves an optimal episode
complexity of $\widetilde{O}(H^3/d_m\epsilon^2)$ in identifying an
$\epsilon$-optimal policy under the time-inhomogeneous episodic MDP model ($H$
is the planning horizon, $d_m$ is a quantity that reflects the exploration of
the logging policy $\mu$). To the best of our knowledge, this is the first time
the optimal rate is shown to be possible for the offline RL setting and the
paper is the first that systematically investigates the uniform convergence in
OPE.
Related papers
- Confident Natural Policy Gradient for Local Planning in $q_π$-realizable Constrained MDPs [44.69257217086967]
The constrained Markov decision process (CMDP) framework emerges as an important reinforcement learning approach for imposing safety or other critical objectives.
In this paper, we address the learning problem given linear function approximation with $q_pi$-realizability.
arXiv Detail & Related papers (2024-06-26T17:57:13Z) - Importance Weighted Actor-Critic for Optimal Conservative Offline
Reinforcement Learning [23.222448307481073]
We propose a new practical algorithm for offline reinforcement learning (RL) in complex environments with insufficient data coverage.
Our algorithm combines the marginalized importance sampling framework with the actor-critic paradigm.
We provide both theoretical analysis and experimental results to validate the effectiveness of our proposed algorithm.
arXiv Detail & Related papers (2023-01-30T07:53:53Z) - Policy learning "without" overlap: Pessimism and generalized empirical Bernstein's inequality [94.89246810243053]
This paper studies offline policy learning, which aims at utilizing observations collected a priori to learn an optimal individualized decision rule.
Existing policy learning methods rely on a uniform overlap assumption, i.e., the propensities of exploring all actions for all individual characteristics must be lower bounded.
We propose Pessimistic Policy Learning (PPL), a new algorithm that optimize lower confidence bounds (LCBs) instead of point estimates.
arXiv Detail & Related papers (2022-12-19T22:43:08Z) - Pessimism in the Face of Confounders: Provably Efficient Offline Reinforcement Learning in Partially Observable Markov Decision Processes [99.26864533035454]
We study offline reinforcement learning (RL) in partially observable Markov decision processes.
We propose the underlineProxy variable underlinePessimistic underlinePolicy underlineOptimization (textttP3O) algorithm.
textttP3O is the first provably efficient offline RL algorithm for POMDPs with a confounded dataset.
arXiv Detail & Related papers (2022-05-26T19:13:55Z) - COptiDICE: Offline Constrained Reinforcement Learning via Stationary
Distribution Correction Estimation [73.17078343706909]
offline constrained reinforcement learning (RL) problem, in which the agent aims to compute a policy that maximizes expected return while satisfying given cost constraints, learning only from a pre-collected dataset.
We present an offline constrained RL algorithm that optimize the policy in the space of the stationary distribution.
Our algorithm, COptiDICE, directly estimates the stationary distribution corrections of the optimal policy with respect to returns, while constraining the cost upper bound, with the goal of yielding a cost-conservative policy for actual constraint satisfaction.
arXiv Detail & Related papers (2022-04-19T15:55:47Z) - Policy Finetuning: Bridging Sample-Efficient Offline and Online
Reinforcement Learning [59.02541753781001]
This paper initiates the theoretical study of policy finetuning, that is, online RL where the learner has additional access to a "reference policy"
We first design a sharp offline reduction algorithm that finds an $varepsilon$ near-optimal policy within $widetildeO(H3SCstar/varepsilon2)$ episodes.
We then establish an $Omega(H3SminCstar, A/varepsilon2)$ sample complexity lower bound for any policy finetuning algorithm, including those that can adaptively explore the
arXiv Detail & Related papers (2021-06-09T08:28:55Z) - Characterizing Uniform Convergence in Offline Policy Evaluation via
model-based approach: Offline Learning, Task-Agnostic and Reward-Free [34.54294677335518]
We study the statistical limits of uniform convergence for offline policy evaluation problems (uniform OPE for short) with model-based methods under episodic MDP setting.
Our main result establishes an episode complexity of $tildeO(H2/d_mepsilon2)$ for emphnear-empirically optimal policies for the MDPs with emphstationary transition.
arXiv Detail & Related papers (2021-05-13T01:36:34Z) - POPO: Pessimistic Offline Policy Optimization [6.122342691982727]
We study why off-policy RL methods fail to learn in offline setting from the value function view.
We propose Pessimistic Offline Policy Optimization (POPO), which learns a pessimistic value function to get a strong policy.
We find that POPO performs surprisingly well and scales to tasks with high-dimensional state and action space.
arXiv Detail & Related papers (2020-12-26T06:24:34Z) - CRPO: A New Approach for Safe Reinforcement Learning with Convergence
Guarantee [61.176159046544946]
In safe reinforcement learning (SRL) problems, an agent explores the environment to maximize an expected total reward and avoids violation of certain constraints.
This is the first-time analysis of SRL algorithms with global optimal policies.
arXiv Detail & Related papers (2020-11-11T16:05:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.