Off-Policy Confidence Interval Estimation with Confounded Markov
Decision Process
- URL: http://arxiv.org/abs/2202.10589v1
- Date: Tue, 22 Feb 2022 00:03:48 GMT
- Title: Off-Policy Confidence Interval Estimation with Confounded Markov
Decision Process
- Authors: Chengchun Shi, Jin Zhu, Ye Shen, Shikai Luo, Hongtu Zhu and Rui Song
- Abstract summary: We show that with some auxiliary variables that mediate the effect of actions on the system dynamics, the target policy's value is identifiable in a confounded Markov decision process.
Our method is justified by theoretical results, simulated and real datasets obtained from ridesharing companies.
- Score: 14.828039846764549
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper is concerned with constructing a confidence interval for a target
policy's value offline based on a pre-collected observational data in infinite
horizon settings. Most of the existing works assume no unmeasured variables
exist that confound the observed actions. This assumption, however, is likely
to be violated in real applications such as healthcare and technological
industries. In this paper, we show that with some auxiliary variables that
mediate the effect of actions on the system dynamics, the target policy's value
is identifiable in a confounded Markov decision process. Based on this result,
we develop an efficient off-policy value estimator that is robust to potential
model misspecification and provide rigorous uncertainty quantification. Our
method is justified by theoretical results, simulated and real datasets
obtained from ridesharing companies.
Related papers
- Distributional Shift-Aware Off-Policy Interval Estimation: A Unified
Error Quantification Framework [8.572441599469597]
We study high-confidence off-policy evaluation in the context of infinite-horizon Markov decision processes.
The objective is to establish a confidence interval (CI) for the target policy value using only offline data pre-collected from unknown behavior policies.
We show that our algorithm is sample-efficient, error-robust, and provably convergent even in non-linear function approximation settings.
arXiv Detail & Related papers (2023-09-23T06:35:44Z) - Conformal Off-Policy Evaluation in Markov Decision Processes [53.786439742572995]
Reinforcement Learning aims at identifying and evaluating efficient control policies from data.
Most methods for this learning task, referred to as Off-Policy Evaluation (OPE), do not come with accuracy and certainty guarantees.
We present a novel OPE method based on Conformal Prediction that outputs an interval containing the true reward of the target policy with a prescribed level of certainty.
arXiv Detail & Related papers (2023-04-05T16:45:11Z) - Uncertainty-Aware Instance Reweighting for Off-Policy Learning [63.31923483172859]
We propose a Uncertainty-aware Inverse Propensity Score estimator (UIPS) for improved off-policy learning.
Experiment results on synthetic and three real-world recommendation datasets demonstrate the advantageous sample efficiency of the proposed UIPS estimator.
arXiv Detail & Related papers (2023-03-11T11:42:26Z) - An Instrumental Variable Approach to Confounded Off-Policy Evaluation [11.785128674216903]
Off-policy evaluation (OPE) is a method for estimating the return of a target policy.
This paper develops an instrumental variable (IV)-based method for consistent OPE in confounded Markov decision processes.
arXiv Detail & Related papers (2022-12-29T22:06:51Z) - Reliable Off-policy Evaluation for Reinforcement Learning [53.486680020852724]
In a sequential decision-making problem, off-policy evaluation estimates the expected cumulative reward of a target policy.
We propose a novel framework that provides robust and optimistic cumulative reward estimates using one or multiple logged data.
arXiv Detail & Related papers (2020-11-08T23:16:19Z) - CoinDICE: Off-Policy Confidence Interval Estimation [107.86876722777535]
We study high-confidence behavior-agnostic off-policy evaluation in reinforcement learning.
We show in a variety of benchmarks that the confidence interval estimates are tighter and more accurate than existing methods.
arXiv Detail & Related papers (2020-10-22T12:39:11Z) - Statistical Bootstrapping for Uncertainty Estimation in Off-Policy
Evaluation [38.31971190670345]
We investigate the potential for statistical bootstrapping to be used as a way to produce calibrated confidence intervals for the true value of the policy.
We show that it can yield accurate confidence intervals in a variety of conditions, including challenging continuous control environments and small data regimes.
arXiv Detail & Related papers (2020-07-27T14:49:22Z) - Doubly Robust Off-Policy Value and Gradient Estimation for Deterministic
Policies [80.42316902296832]
We study the estimation of policy value and gradient of a deterministic policy from off-policy data when actions are continuous.
In this setting, standard importance sampling and doubly robust estimators for policy value and gradient fail because the density ratio does not exist.
We propose several new doubly robust estimators based on different kernelization approaches.
arXiv Detail & Related papers (2020-06-06T15:52:05Z) - GenDICE: Generalized Offline Estimation of Stationary Values [108.17309783125398]
We show that effective estimation can still be achieved in important applications.
Our approach is based on estimating a ratio that corrects for the discrepancy between the stationary and empirical distributions.
The resulting algorithm, GenDICE, is straightforward and effective.
arXiv Detail & Related papers (2020-02-21T00:27:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.