Asymptotically Efficient Off-Policy Evaluation for Tabular Reinforcement
Learning
- URL: http://arxiv.org/abs/2001.10742v1
- Date: Wed, 29 Jan 2020 09:56:26 GMT
- Title: Asymptotically Efficient Off-Policy Evaluation for Tabular Reinforcement
Learning
- Authors: Ming Yin and Yu-Xiang Wang (University of California Santa Barbara)
- Abstract summary: We consider the problem of off-policy evaluation for reinforcement learning.
The goal is to estimate the expected reward of a target policy using offline data collected by running a logging policy.
We show that a marginalized importance sampling (MIS) approach can be used to achieve an estimation error of order $O(H3/ n)$ in mean square error.
- Score: 20.546806161935578
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We consider the problem of off-policy evaluation for reinforcement learning,
where the goal is to estimate the expected reward of a target policy $\pi$
using offline data collected by running a logging policy $\mu$. Standard
importance-sampling based approaches for this problem suffer from a variance
that scales exponentially with time horizon $H$, which motivates a splurge of
recent interest in alternatives that break the "Curse of Horizon" (Liu et al.
2018, Xie et al. 2019). In particular, it was shown that a marginalized
importance sampling (MIS) approach can be used to achieve an estimation error
of order $O(H^3/ n)$ in mean square error (MSE) under an episodic Markov
Decision Process model with finite states and potentially infinite actions. The
MSE bound however is still a factor of $H$ away from a Cramer-Rao lower bound
of order $\Omega(H^2/n)$. In this paper, we prove that with a simple
modification to the MIS estimator, we can asymptotically attain the Cramer-Rao
lower bound, provided that the action space is finite. We also provide a
general method for constructing MIS estimators with high-probability error
bounds.
Related papers
- Quantile-Optimal Policy Learning under Unmeasured Confounding [55.72891849926314]
We study quantile-optimal policy learning where the goal is to find a policy whose reward distribution has the largest $alpha$-quantile for some $alpha in (0, 1)$.<n>Such a problem suffers from three main challenges: (i) nonlinearity of the quantile objective as a functional of the reward distribution, (ii) unobserved confounding issue, and (iii) insufficient coverage of the offline dataset.
arXiv Detail & Related papers (2025-06-08T13:37:38Z) - Regret Analysis of Average-Reward Unichain MDPs via an Actor-Critic Approach [33.38582292895673]
We propose a Natural Actor-Critic with order-optimal regret of $tildeO(sqrtT)$ in infinite-reward average-reward Decision Processes.<n> NACB employs function approximation for both actor and the critic, enabling scalability to large state potential periodicity and action spaces.
arXiv Detail & Related papers (2025-05-26T13:43:02Z) - Adversarially-Robust TD Learning with Markovian Data: Finite-Time Rates and Fundamental Limits [2.07180164747172]
Motivated by harsh, real-world environments, we revisit the policy evaluation problem from the perspective of adversarial robustness.
We develop a novel algorithm called Robust-TD and prove that its finite-time guarantees match that of vanilla TD with linear function approximation up to a small $O(epsilon)$ term.
To our knowledge, these results are the first of their kind in the context of adversarial approximation schemes driven by Markov noise.
arXiv Detail & Related papers (2025-02-07T05:05:42Z) - A Finite-Sample Analysis of an Actor-Critic Algorithm for Mean-Variance Optimization in a Discounted MDP [1.0923877073891446]
We analyze a Temporal Difference (TD) learning algorithm with linear function approximation (LFA) for policy evaluation.
We derive finite-sample bounds that hold (i) in the mean-squared sense and (ii) with high probability under tail iterate averaging.
These results establish finite-sample theoretical guarantees for risk-sensitive actor-critic methods in reinforcement learning.
arXiv Detail & Related papers (2024-06-12T05:49:53Z) - Efficient and Sharp Off-Policy Evaluation in Robust Markov Decision Processes [44.974100402600165]
We study the evaluation of a policy best-parametric and worst-case perturbations to a decision process (MDP)
We use transition observations from the original MDP, whether they are generated under the same or a different policy.
Our estimator is also estimated statistical inference using Wald confidence intervals.
arXiv Detail & Related papers (2024-03-29T18:11:49Z) - Sample Complexity of Nonparametric Off-Policy Evaluation on
Low-Dimensional Manifolds using Deep Networks [71.95722100511627]
We consider the off-policy evaluation problem of reinforcement learning using deep neural networks.
We show that, by choosing network size appropriately, one can leverage the low-dimensional manifold structure in the Markov decision process.
arXiv Detail & Related papers (2022-06-06T20:25:20Z) - First-Order Regret in Reinforcement Learning with Linear Function
Approximation: A Robust Estimation Approach [57.570201404222935]
We show that it is possible to obtain regret scaling as $mathcalO(sqrtV_1star K)$ in reinforcement learning with large state spaces.
We demonstrate that existing techniques based on at least squares estimation are insufficient to obtain this result.
arXiv Detail & Related papers (2021-12-07T00:29:57Z) - Breaking the Sample Complexity Barrier to Regret-Optimal Model-Free
Reinforcement Learning [52.76230802067506]
A novel model-free algorithm is proposed to minimize regret in episodic reinforcement learning.
The proposed algorithm employs an em early-settled reference update rule, with the aid of two Q-learning sequences.
The design principle of our early-settled variance reduction method might be of independent interest to other RL settings.
arXiv Detail & Related papers (2021-10-09T21:13:48Z) - Stochastic Shortest Path: Minimax, Parameter-Free and Towards
Horizon-Free Regret [144.6358229217845]
We study the problem of learning in the shortest path (SSP) setting, where an agent seeks to minimize the expected cost accumulated before reaching a goal state.
We design a novel model-based algorithm EB-SSP that carefully skews the empirical transitions and perturbs the empirical costs with an exploration bonus.
We prove that EB-SSP achieves the minimax regret rate $widetildeO(B_star sqrtS A K)$, where $K$ is the number of episodes, $S$ is the number of states, $A$
arXiv Detail & Related papers (2021-04-22T17:20:48Z) - Black-box Off-policy Estimation for Infinite-Horizon Reinforcement
Learning [26.880437279977155]
Off-policy estimation for long-horizon problems is important in many real-life applications such as healthcare and robotics.
We develop a new estimator that computes importance ratios of stationary distributions without knowledge of how the off-policy data are collected.
arXiv Detail & Related papers (2020-03-24T21:44:51Z) - Upper Confidence Primal-Dual Reinforcement Learning for CMDP with
Adversarial Loss [145.54544979467872]
We consider online learning for episodically constrained Markov decision processes (CMDPs)
We propose a new emphupper confidence primal-dual algorithm, which only requires the trajectories sampled from the transition model.
Our analysis incorporates a new high-probability drift analysis of Lagrange multiplier processes into the celebrated regret analysis of upper confidence reinforcement learning.
arXiv Detail & Related papers (2020-03-02T05:02:23Z) - Minimax-Optimal Off-Policy Evaluation with Linear Function Approximation [49.502277468627035]
This paper studies the statistical theory of batch data reinforcement learning with function approximation.
Consider the off-policy evaluation problem, which is to estimate the cumulative value of a new target policy from logged history.
arXiv Detail & Related papers (2020-02-21T19:20:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.