Non-asymptotic Confidence Intervals of Off-policy Evaluation: Primal and
Dual Bounds
- URL: http://arxiv.org/abs/2103.05741v1
- Date: Tue, 9 Mar 2021 22:31:20 GMT
- Title: Non-asymptotic Confidence Intervals of Off-policy Evaluation: Primal and
Dual Bounds
- Authors: Yihao Feng, Ziyang Tang, Na Zhang, Qiang Liu
- Abstract summary: Off-policy evaluation (OPE) is the task of estimating the expected reward of a given policy based on offline data previously collected under different policies.
This work considers the problem of constructing non-asymptotic confidence intervals in infinite-horizon off-policy evaluation.
We develop a practical algorithm through a primal-dual optimization-based approach.
- Score: 21.520045697447372
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Off-policy evaluation (OPE) is the task of estimating the expected reward of
a given policy based on offline data previously collected under different
policies. Therefore, OPE is a key step in applying reinforcement learning to
real-world domains such as medical treatment, where interactive data collection
is expensive or even unsafe. As the observed data tends to be noisy and
limited, it is essential to provide rigorous uncertainty quantification, not
just a point estimation, when applying OPE to make high stakes decisions. This
work considers the problem of constructing non-asymptotic confidence intervals
in infinite-horizon off-policy evaluation, which remains a challenging open
question. We develop a practical algorithm through a primal-dual
optimization-based approach, which leverages the kernel Bellman loss (KBL) of
Feng et al.(2019) and a new martingale concentration inequality of KBL
applicable to time-dependent data with unknown mixing conditions. Our algorithm
makes minimum assumptions on the data and the function class of the Q-function,
and works for the behavior-agnostic settings where the data is collected under
a mix of arbitrary unknown behavior policies. We present empirical results that
clearly demonstrate the advantages of our approach over existing methods.
Related papers
- Bi-Level Offline Policy Optimization with Limited Exploration [1.8130068086063336]
We study offline reinforcement learning (RL) which seeks to learn a good policy based on a fixed, pre-collected dataset.
We propose a bi-level structured policy optimization algorithm that models a hierarchical interaction between the policy (upper-level) and the value function (lower-level)
We evaluate our model using a blend of synthetic, benchmark, and real-world datasets for offline RL, showing that it performs competitively with state-of-the-art methods.
arXiv Detail & Related papers (2023-10-10T02:45:50Z) - Variance-Aware Off-Policy Evaluation with Linear Function Approximation [85.75516599931632]
We study the off-policy evaluation problem in reinforcement learning with linear function approximation.
We propose an algorithm, VA-OPE, which uses the estimated variance of the value function to reweight the Bellman residual in Fitted Q-Iteration.
arXiv Detail & Related papers (2021-06-22T17:58:46Z) - Reliable Off-policy Evaluation for Reinforcement Learning [53.486680020852724]
In a sequential decision-making problem, off-policy evaluation estimates the expected cumulative reward of a target policy.
We propose a novel framework that provides robust and optimistic cumulative reward estimates using one or multiple logged data.
arXiv Detail & Related papers (2020-11-08T23:16:19Z) - CoinDICE: Off-Policy Confidence Interval Estimation [107.86876722777535]
We study high-confidence behavior-agnostic off-policy evaluation in reinforcement learning.
We show in a variety of benchmarks that the confidence interval estimates are tighter and more accurate than existing methods.
arXiv Detail & Related papers (2020-10-22T12:39:11Z) - Accountable Off-Policy Evaluation With Kernel Bellman Statistics [29.14119984573459]
We consider off-policy evaluation (OPE), which evaluates the performance of a new policy from observed data collected from previous experiments.
Due to the limited information from off-policy data, it is highly desirable to construct rigorous confidence intervals, not just point estimation.
We propose a new variational framework which reduces the problem of calculating tight confidence bounds in OPE.
arXiv Detail & Related papers (2020-08-15T07:24:38Z) - Doubly Robust Off-Policy Value and Gradient Estimation for Deterministic
Policies [80.42316902296832]
We study the estimation of policy value and gradient of a deterministic policy from off-policy data when actions are continuous.
In this setting, standard importance sampling and doubly robust estimators for policy value and gradient fail because the density ratio does not exist.
We propose several new doubly robust estimators based on different kernelization approaches.
arXiv Detail & Related papers (2020-06-06T15:52:05Z) - Black-box Off-policy Estimation for Infinite-Horizon Reinforcement
Learning [26.880437279977155]
Off-policy estimation for long-horizon problems is important in many real-life applications such as healthcare and robotics.
We develop a new estimator that computes importance ratios of stationary distributions without knowledge of how the off-policy data are collected.
arXiv Detail & Related papers (2020-03-24T21:44:51Z) - Minimax-Optimal Off-Policy Evaluation with Linear Function Approximation [49.502277468627035]
This paper studies the statistical theory of batch data reinforcement learning with function approximation.
Consider the off-policy evaluation problem, which is to estimate the cumulative value of a new target policy from logged history.
arXiv Detail & Related papers (2020-02-21T19:20:57Z) - Confounding-Robust Policy Evaluation in Infinite-Horizon Reinforcement
Learning [70.01650994156797]
Off- evaluation of sequential decision policies from observational data is necessary in batch reinforcement learning such as education healthcare.
We develop an approach that estimates the bounds of a given policy.
We prove convergence to the sharp bounds as we collect more confounded data.
arXiv Detail & Related papers (2020-02-11T16:18:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.