An Instrumental Variable Approach to Confounded Off-Policy Evaluation
- URL: http://arxiv.org/abs/2212.14468v1
- Date: Thu, 29 Dec 2022 22:06:51 GMT
- Title: An Instrumental Variable Approach to Confounded Off-Policy Evaluation
- Authors: Yang Xu, Jin Zhu, Chengchun Shi, Shikai Luo, and Rui Song
- Abstract summary: Off-policy evaluation (OPE) is a method for estimating the return of a target policy.
This paper develops an instrumental variable (IV)-based method for consistent OPE in confounded Markov decision processes.
- Score: 11.785128674216903
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Off-policy evaluation (OPE) is a method for estimating the return of a target
policy using some pre-collected observational data generated by a potentially
different behavior policy. In some cases, there may be unmeasured variables
that can confound the action-reward or action-next-state relationships,
rendering many existing OPE approaches ineffective. This paper develops an
instrumental variable (IV)-based method for consistent OPE in confounded Markov
decision processes (MDPs). Similar to single-stage decision making, we show
that IV enables us to correctly identify the target policy's value in infinite
horizon settings as well. Furthermore, we propose an efficient and robust value
estimator and illustrate its effectiveness through extensive simulations and
analysis of real data from a world-leading short-video platform.
Related papers
- Off-Policy Evaluation for Large Action Spaces via Policy Convolution [60.6953713877886]
Policy Convolution family of estimators uses latent structure within actions to strategically convolve the logging and target policies.
Experiments on synthetic and benchmark datasets demonstrate remarkable mean squared error (MSE) improvements when using PC.
arXiv Detail & Related papers (2023-10-24T01:00:01Z) - Statistically Efficient Variance Reduction with Double Policy Estimation
for Off-Policy Evaluation in Sequence-Modeled Reinforcement Learning [53.97273491846883]
We propose DPE: an RL algorithm that blends offline sequence modeling and offline reinforcement learning with Double Policy Estimation.
We validate our method in multiple tasks of OpenAI Gym with D4RL benchmarks.
arXiv Detail & Related papers (2023-08-28T20:46:07Z) - Conformal Off-Policy Evaluation in Markov Decision Processes [53.786439742572995]
Reinforcement Learning aims at identifying and evaluating efficient control policies from data.
Most methods for this learning task, referred to as Off-Policy Evaluation (OPE), do not come with accuracy and certainty guarantees.
We present a novel OPE method based on Conformal Prediction that outputs an interval containing the true reward of the target policy with a prescribed level of certainty.
arXiv Detail & Related papers (2023-04-05T16:45:11Z) - Uncertainty-Aware Instance Reweighting for Off-Policy Learning [63.31923483172859]
We propose a Uncertainty-aware Inverse Propensity Score estimator (UIPS) for improved off-policy learning.
Experiment results on synthetic and three real-world recommendation datasets demonstrate the advantageous sample efficiency of the proposed UIPS estimator.
arXiv Detail & Related papers (2023-03-11T11:42:26Z) - Quantile Off-Policy Evaluation via Deep Conditional Generative Learning [21.448553360543478]
Off-Policy evaluation (OPE) is concerned with evaluating a new target policy using offline data generated by a potentially different behavior policy.
We propose a doubly-robust inference procedure for quantile OPE in sequential decision making.
We demonstrate the advantages of this proposed estimator through both simulations and a real-world dataset from a short-video platform.
arXiv Detail & Related papers (2022-12-29T22:01:43Z) - Off-Policy Confidence Interval Estimation with Confounded Markov
Decision Process [14.828039846764549]
We show that with some auxiliary variables that mediate the effect of actions on the system dynamics, the target policy's value is identifiable in a confounded Markov decision process.
Our method is justified by theoretical results, simulated and real datasets obtained from ridesharing companies.
arXiv Detail & Related papers (2022-02-22T00:03:48Z) - Proximal Reinforcement Learning: Efficient Off-Policy Evaluation in
Partially Observed Markov Decision Processes [65.91730154730905]
In applications of offline reinforcement learning to observational data, such as in healthcare or education, a general concern is that observed actions might be affected by unobserved factors.
Here we tackle this by considering off-policy evaluation in a partially observed Markov decision process (POMDP)
We extend the framework of proximal causal inference to our POMDP setting, providing a variety of settings where identification is made possible.
arXiv Detail & Related papers (2021-10-28T17:46:14Z) - Variance-Aware Off-Policy Evaluation with Linear Function Approximation [85.75516599931632]
We study the off-policy evaluation problem in reinforcement learning with linear function approximation.
We propose an algorithm, VA-OPE, which uses the estimated variance of the value function to reweight the Bellman residual in Fitted Q-Iteration.
arXiv Detail & Related papers (2021-06-22T17:58:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.