Black-box Off-policy Estimation for Infinite-Horizon Reinforcement
Learning
- URL: http://arxiv.org/abs/2003.11126v1
- Date: Tue, 24 Mar 2020 21:44:51 GMT
- Title: Black-box Off-policy Estimation for Infinite-Horizon Reinforcement
Learning
- Authors: Ali Mousavi, Lihong Li, Qiang Liu, Denny Zhou
- Abstract summary: Off-policy estimation for long-horizon problems is important in many real-life applications such as healthcare and robotics.
We develop a new estimator that computes importance ratios of stationary distributions without knowledge of how the off-policy data are collected.
- Score: 26.880437279977155
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Off-policy estimation for long-horizon problems is important in many
real-life applications such as healthcare and robotics, where high-fidelity
simulators may not be available and on-policy evaluation is expensive or
impossible. Recently, \cite{liu18breaking} proposed an approach that avoids the
\emph{curse of horizon} suffered by typical importance-sampling-based methods.
While showing promising results, this approach is limited in practice as it
requires data be drawn from the \emph{stationary distribution} of a
\emph{known} behavior policy. In this work, we propose a novel approach that
eliminates such limitations. In particular, we formulate the problem as solving
for the fixed point of a certain operator. Using tools from Reproducing Kernel
Hilbert Spaces (RKHSs), we develop a new estimator that computes importance
ratios of stationary distributions, without knowledge of how the off-policy
data are collected. We analyze its asymptotic consistency and finite-sample
generalization. Experiments on benchmarks verify the effectiveness of our
approach.
Related papers
- A Tale of Sampling and Estimation in Discounted Reinforcement Learning [50.43256303670011]
We present a minimax lower bound on the discounted mean estimation problem.
We show that estimating the mean by directly sampling from the discounted kernel of the Markov process brings compelling statistical properties.
arXiv Detail & Related papers (2023-04-11T09:13:17Z) - Kernel Conditional Moment Constraints for Confounding Robust Inference [22.816690686310714]
We study policy evaluation of offline contextual bandits subject to unobserved confounders.
We propose a general estimator that provides a sharp lower bound of the policy value.
arXiv Detail & Related papers (2023-02-26T16:44:13Z) - Quantile Off-Policy Evaluation via Deep Conditional Generative Learning [21.448553360543478]
Off-Policy evaluation (OPE) is concerned with evaluating a new target policy using offline data generated by a potentially different behavior policy.
We propose a doubly-robust inference procedure for quantile OPE in sequential decision making.
We demonstrate the advantages of this proposed estimator through both simulations and a real-world dataset from a short-video platform.
arXiv Detail & Related papers (2022-12-29T22:01:43Z) - Sample Complexity of Nonparametric Off-Policy Evaluation on
Low-Dimensional Manifolds using Deep Networks [71.95722100511627]
We consider the off-policy evaluation problem of reinforcement learning using deep neural networks.
We show that, by choosing network size appropriately, one can leverage the low-dimensional manifold structure in the Markov decision process.
arXiv Detail & Related papers (2022-06-06T20:25:20Z) - Learning to Estimate Without Bias [57.82628598276623]
Gauss theorem states that the weighted least squares estimator is a linear minimum variance unbiased estimation (MVUE) in linear models.
In this paper, we take a first step towards extending this result to non linear settings via deep learning with bias constraints.
A second motivation to BCE is in applications where multiple estimates of the same unknown are averaged for improved performance.
arXiv Detail & Related papers (2021-10-24T10:23:51Z) - Non-asymptotic Confidence Intervals of Off-policy Evaluation: Primal and
Dual Bounds [21.520045697447372]
Off-policy evaluation (OPE) is the task of estimating the expected reward of a given policy based on offline data previously collected under different policies.
This work considers the problem of constructing non-asymptotic confidence intervals in infinite-horizon off-policy evaluation.
We develop a practical algorithm through a primal-dual optimization-based approach.
arXiv Detail & Related papers (2021-03-09T22:31:20Z) - CoinDICE: Off-Policy Confidence Interval Estimation [107.86876722777535]
We study high-confidence behavior-agnostic off-policy evaluation in reinforcement learning.
We show in a variety of benchmarks that the confidence interval estimates are tighter and more accurate than existing methods.
arXiv Detail & Related papers (2020-10-22T12:39:11Z) - High-Dimensional Robust Mean Estimation via Gradient Descent [73.61354272612752]
We show that the problem of robust mean estimation in the presence of a constant adversarial fraction can be solved by gradient descent.
Our work establishes an intriguing connection between the near non-lemma estimation and robust statistics.
arXiv Detail & Related papers (2020-05-04T10:48:04Z) - Confounding-Robust Policy Evaluation in Infinite-Horizon Reinforcement
Learning [70.01650994156797]
Off- evaluation of sequential decision policies from observational data is necessary in batch reinforcement learning such as education healthcare.
We develop an approach that estimates the bounds of a given policy.
We prove convergence to the sharp bounds as we collect more confounded data.
arXiv Detail & Related papers (2020-02-11T16:18:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.