Related papers: Asymptotically Efficient Off-Policy Evaluation for Tabular Reinforcement Learning

Asymptotically Efficient Off-Policy Evaluation for Tabular Reinforcement Learning

URL: http://arxiv.org/abs/2001.10742v1
Date: Wed, 29 Jan 2020 09:56:26 GMT
Title: Asymptotically Efficient Off-Policy Evaluation for Tabular Reinforcement Learning
Authors: Ming Yin and Yu-Xiang Wang (University of California Santa Barbara)
Abstract summary: We consider the problem of off-policy evaluation for reinforcement learning. The goal is to estimate the expected reward of a target policy using offline data collected by running a logging policy. We show that a marginalized importance sampling (MIS) approach can be used to achieve an estimation error of order $O(H3/ n)$ in mean square error.
Score: 20.546806161935578
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We consider the problem of off-policy evaluation for reinforcement learning, where the goal is to estimate the expected reward of a target policy $\pi$ using offline data collected by running a logging policy $\mu$. Standard importance-sampling based approaches for this problem suffer from a variance that scales exponentially with time horizon $H$, which motivates a splurge of recent interest in alternatives that break the "Curse of Horizon" (Liu et al. 2018, Xie et al. 2019). In particular, it was shown that a marginalized importance sampling (MIS) approach can be used to achieve an estimation error of order $O(H^3/ n)$ in mean square error (MSE) under an episodic Markov Decision Process model with finite states and potentially infinite actions. The MSE bound however is still a factor of $H$ away from a Cramer-Rao lower bound of order $\Omega(H^2/n)$. In this paper, we prove that with a simple modification to the MIS estimator, we can asymptotically attain the Cramer-Rao lower bound, provided that the action space is finite. We also provide a general method for constructing MIS estimators with high-probability error bounds.

Related papers

Quantile-Optimal Policy Learning under Unmeasured Confounding [55.72891849926314]
We study quantile-optimal policy learning where the goal is to find a policy whose reward distribution has the largest $alpha$-quantile for some $alpha in (0, 1)$.<n>Such a problem suffers from three main challenges: (i) nonlinearity of the quantile objective as a functional of the reward distribution, (ii) unobserved confounding issue, and (iii) insufficient coverage of the offline dataset.
arXiv Detail & Related papers (2025-06-08T13:37:38Z)
Regret Analysis of Average-Reward Unichain MDPs via an Actor-Critic Approach [33.38582292895673]
We propose a Natural Actor-Critic with order-optimal regret of $tildeO(sqrtT)$ in infinite-reward average-reward Decision Processes.<n> NACB employs function approximation for both actor and the critic, enabling scalability to large state potential periodicity and action spaces.
arXiv Detail & Related papers (2025-05-26T13:43:02Z)
Adversarially-Robust TD Learning with Markovian Data: Finite-Time Rates and Fundamental Limits [2.07180164747172]
Motivated by harsh, real-world environments, we revisit the policy evaluation problem from the perspective of adversarial robustness. We develop a novel algorithm called Robust-TD and prove that its finite-time guarantees match that of vanilla TD with linear function approximation up to a small $O(epsilon)$ term. To our knowledge, these results are the first of their kind in the context of adversarial approximation schemes driven by Markov noise.
arXiv Detail & Related papers (2025-02-07T05:05:42Z)
A Finite-Sample Analysis of an Actor-Critic Algorithm for Mean-Variance Optimization in a Discounted MDP [1.0923877073891446]
We analyze a Temporal Difference (TD) learning algorithm with linear function approximation (LFA) for policy evaluation. We derive finite-sample bounds that hold (i) in the mean-squared sense and (ii) with high probability under tail iterate averaging. These results establish finite-sample theoretical guarantees for risk-sensitive actor-critic methods in reinforcement learning.
arXiv Detail & Related papers (2024-06-12T05:49:53Z)
Efficient and Sharp Off-Policy Evaluation in Robust Markov Decision Processes [44.974100402600165]
We study the evaluation of a policy best-parametric and worst-case perturbations to a decision process (MDP) We use transition observations from the original MDP, whether they are generated under the same or a different policy. Our estimator is also estimated statistical inference using Wald confidence intervals.
arXiv Detail & Related papers (2024-03-29T18:11:49Z)
Sample Complexity of Nonparametric Off-Policy Evaluation on Low-Dimensional Manifolds using Deep Networks [71.95722100511627]
We consider the off-policy evaluation problem of reinforcement learning using deep neural networks. We show that, by choosing network size appropriately, one can leverage the low-dimensional manifold structure in the Markov decision process.
arXiv Detail & Related papers (2022-06-06T20:25:20Z)
First-Order Regret in Reinforcement Learning with Linear Function Approximation: A Robust Estimation Approach [57.570201404222935]
We show that it is possible to obtain regret scaling as $mathcalO(sqrtV_1star K)$ in reinforcement learning with large state spaces. We demonstrate that existing techniques based on at least squares estimation are insufficient to obtain this result.
arXiv Detail & Related papers (2021-12-07T00:29:57Z)
Breaking the Sample Complexity Barrier to Regret-Optimal Model-Free Reinforcement Learning [52.76230802067506]
A novel model-free algorithm is proposed to minimize regret in episodic reinforcement learning. The proposed algorithm employs an em early-settled reference update rule, with the aid of two Q-learning sequences. The design principle of our early-settled variance reduction method might be of independent interest to other RL settings.
arXiv Detail & Related papers (2021-10-09T21:13:48Z)
Stochastic Shortest Path: Minimax, Parameter-Free and Towards Horizon-Free Regret [144.6358229217845]
We study the problem of learning in the shortest path (SSP) setting, where an agent seeks to minimize the expected cost accumulated before reaching a goal state. We design a novel model-based algorithm EB-SSP that carefully skews the empirical transitions and perturbs the empirical costs with an exploration bonus. We prove that EB-SSP achieves the minimax regret rate $widetildeO(B_star sqrtS A K)$, where $K$ is the number of episodes, $S$ is the number of states, $A$
arXiv Detail & Related papers (2021-04-22T17:20:48Z)
Black-box Off-policy Estimation for Infinite-Horizon Reinforcement Learning [26.880437279977155]
Off-policy estimation for long-horizon problems is important in many real-life applications such as healthcare and robotics. We develop a new estimator that computes importance ratios of stationary distributions without knowledge of how the off-policy data are collected.
arXiv Detail & Related papers (2020-03-24T21:44:51Z)
Upper Confidence Primal-Dual Reinforcement Learning for CMDP with Adversarial Loss [145.54544979467872]
We consider online learning for episodically constrained Markov decision processes (CMDPs) We propose a new emphupper confidence primal-dual algorithm, which only requires the trajectories sampled from the transition model. Our analysis incorporates a new high-probability drift analysis of Lagrange multiplier processes into the celebrated regret analysis of upper confidence reinforcement learning.
arXiv Detail & Related papers (2020-03-02T05:02:23Z)
Minimax-Optimal Off-Policy Evaluation with Linear Function Approximation [49.502277468627035]
This paper studies the statistical theory of batch data reinforcement learning with function approximation. Consider the off-policy evaluation problem, which is to estimate the cumulative value of a new target policy from logged history.
arXiv Detail & Related papers (2020-02-21T19:20:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.