Related papers: Off-Policy Confidence Interval Estimation with Confounded Markov Decision Process

Off-Policy Confidence Interval Estimation with Confounded Markov Decision Process

URL: http://arxiv.org/abs/2202.10589v1
Date: Tue, 22 Feb 2022 00:03:48 GMT
Title: Off-Policy Confidence Interval Estimation with Confounded Markov Decision Process
Authors: Chengchun Shi, Jin Zhu, Ye Shen, Shikai Luo, Hongtu Zhu and Rui Song
Abstract summary: We show that with some auxiliary variables that mediate the effect of actions on the system dynamics, the target policy's value is identifiable in a confounded Markov decision process. Our method is justified by theoretical results, simulated and real datasets obtained from ridesharing companies.
Score: 14.828039846764549
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This paper is concerned with constructing a confidence interval for a target policy's value offline based on a pre-collected observational data in infinite horizon settings. Most of the existing works assume no unmeasured variables exist that confound the observed actions. This assumption, however, is likely to be violated in real applications such as healthcare and technological industries. In this paper, we show that with some auxiliary variables that mediate the effect of actions on the system dynamics, the target policy's value is identifiable in a confounded Markov decision process. Based on this result, we develop an efficient off-policy value estimator that is robust to potential model misspecification and provide rigorous uncertainty quantification. Our method is justified by theoretical results, simulated and real datasets obtained from ridesharing companies.

Related papers

Model uncertainty quantification using feature confidence sets for outcome excursions [0.0]
This paper introduces a novel, model-agnostic framework for quantifying uncertainty in continuous and binary outcomes. It is validated through simulations and applied to real-world datasets in contexts such as housing price prediction and time to sepsis diagnosis in healthcare.
arXiv Detail & Related papers (2025-04-28T04:08:07Z)
Uncertainty Quantification and Causal Considerations for Off-Policy Decision Making [4.514386953429771]
Off-policy evaluation (OPE) seeks to assess the performance of a new policy using data collected under a different policy. Existing OPE methodologies suffer from several limitations arising from statistical uncertainty as well as causal considerations. We introduce the Marginal Ratio (MR) estimator, a novel OPE method that reduces variance by focusing on the marginal distribution of outcomes. Next, we propose Conformal Off-Policy Prediction (COPP), a principled approach for uncertainty quantification in OPE. Finally, we address causal unidentifiability in off-policy decision-making by developing novel bounds for sequential decision settings
arXiv Detail & Related papers (2025-02-09T20:05:19Z)
Distributional Shift-Aware Off-Policy Interval Estimation: A Unified Error Quantification Framework [8.572441599469597]
We study high-confidence off-policy evaluation in the context of infinite-horizon Markov decision processes. The objective is to establish a confidence interval (CI) for the target policy value using only offline data pre-collected from unknown behavior policies. We show that our algorithm is sample-efficient, error-robust, and provably convergent even in non-linear function approximation settings.
arXiv Detail & Related papers (2023-09-23T06:35:44Z)
Conformal Off-Policy Evaluation in Markov Decision Processes [53.786439742572995]
Reinforcement Learning aims at identifying and evaluating efficient control policies from data. Most methods for this learning task, referred to as Off-Policy Evaluation (OPE), do not come with accuracy and certainty guarantees. We present a novel OPE method based on Conformal Prediction that outputs an interval containing the true reward of the target policy with a prescribed level of certainty.
arXiv Detail & Related papers (2023-04-05T16:45:11Z)
Uncertainty-Aware Instance Reweighting for Off-Policy Learning [63.31923483172859]
We propose a Uncertainty-aware Inverse Propensity Score estimator (UIPS) for improved off-policy learning. Experiment results on synthetic and three real-world recommendation datasets demonstrate the advantageous sample efficiency of the proposed UIPS estimator.
arXiv Detail & Related papers (2023-03-11T11:42:26Z)
An Instrumental Variable Approach to Confounded Off-Policy Evaluation [11.785128674216903]
Off-policy evaluation (OPE) is a method for estimating the return of a target policy. This paper develops an instrumental variable (IV)-based method for consistent OPE in confounded Markov decision processes.
arXiv Detail & Related papers (2022-12-29T22:06:51Z)
Reliable Off-policy Evaluation for Reinforcement Learning [53.486680020852724]
In a sequential decision-making problem, off-policy evaluation estimates the expected cumulative reward of a target policy. We propose a novel framework that provides robust and optimistic cumulative reward estimates using one or multiple logged data.
arXiv Detail & Related papers (2020-11-08T23:16:19Z)
CoinDICE: Off-Policy Confidence Interval Estimation [107.86876722777535]
We study high-confidence behavior-agnostic off-policy evaluation in reinforcement learning. We show in a variety of benchmarks that the confidence interval estimates are tighter and more accurate than existing methods.
arXiv Detail & Related papers (2020-10-22T12:39:11Z)
Statistical Bootstrapping for Uncertainty Estimation in Off-Policy Evaluation [38.31971190670345]
We investigate the potential for statistical bootstrapping to be used as a way to produce calibrated confidence intervals for the true value of the policy. We show that it can yield accurate confidence intervals in a variety of conditions, including challenging continuous control environments and small data regimes.
arXiv Detail & Related papers (2020-07-27T14:49:22Z)
Doubly Robust Off-Policy Value and Gradient Estimation for Deterministic Policies [80.42316902296832]
We study the estimation of policy value and gradient of a deterministic policy from off-policy data when actions are continuous. In this setting, standard importance sampling and doubly robust estimators for policy value and gradient fail because the density ratio does not exist. We propose several new doubly robust estimators based on different kernelization approaches.
arXiv Detail & Related papers (2020-06-06T15:52:05Z)
GenDICE: Generalized Offline Estimation of Stationary Values [108.17309783125398]
We show that effective estimation can still be achieved in important applications. Our approach is based on estimating a ratio that corrects for the discrepancy between the stationary and empirical distributions. The resulting algorithm, GenDICE, is straightforward and effective.
arXiv Detail & Related papers (2020-02-21T00:27:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.