Off-Policy Evaluation with Out-of-Sample Guarantees
- URL: http://arxiv.org/abs/2301.08649v3
- Date: Fri, 30 Jun 2023 07:58:07 GMT
- Title: Off-Policy Evaluation with Out-of-Sample Guarantees
- Authors: Sofia Ek, Dave Zachariah, Fredrik D. Johansson, Petre Stoica
- Abstract summary: We consider the problem of evaluating the performance of a decision policy using past observational data.
We show that it is possible to draw such inferences with finite-sample coverage guarantees about the entire loss distribution.
The evaluation method can be used to certify the performance of a policy using observational data under a specified range of credible model assumptions.
- Score: 21.527138355664174
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We consider the problem of evaluating the performance of a decision policy
using past observational data. The outcome of a policy is measured in terms of
a loss (aka. disutility or negative reward) and the main problem is making
valid inferences about its out-of-sample loss when the past data was observed
under a different and possibly unknown policy. Using a sample-splitting method,
we show that it is possible to draw such inferences with finite-sample coverage
guarantees about the entire loss distribution, rather than just its mean.
Importantly, the method takes into account model misspecifications of the past
policy - including unmeasured confounding. The evaluation method can be used to
certify the performance of a policy using observational data under a specified
range of credible model assumptions.
Related papers
- Externally Valid Policy Evaluation Combining Trial and Observational Data [6.875312133832077]
We seek to use trial data to draw valid inferences about the outcome of a policy on the target population.
We develop a method that yields certifiably valid trial-based policy evaluations under any specified range of model miscalibrations.
arXiv Detail & Related papers (2023-10-23T10:01:50Z) - Conformal Off-Policy Evaluation in Markov Decision Processes [53.786439742572995]
Reinforcement Learning aims at identifying and evaluating efficient control policies from data.
Most methods for this learning task, referred to as Off-Policy Evaluation (OPE), do not come with accuracy and certainty guarantees.
We present a novel OPE method based on Conformal Prediction that outputs an interval containing the true reward of the target policy with a prescribed level of certainty.
arXiv Detail & Related papers (2023-04-05T16:45:11Z) - Uncertainty-Aware Instance Reweighting for Off-Policy Learning [63.31923483172859]
We propose a Uncertainty-aware Inverse Propensity Score estimator (UIPS) for improved off-policy learning.
Experiment results on synthetic and three real-world recommendation datasets demonstrate the advantageous sample efficiency of the proposed UIPS estimator.
arXiv Detail & Related papers (2023-03-11T11:42:26Z) - Conformal Off-Policy Prediction in Contextual Bandits [54.67508891852636]
Conformal off-policy prediction can output reliable predictive intervals for the outcome under a new target policy.
We provide theoretical finite-sample guarantees without making any additional assumptions beyond the standard contextual bandit setup.
arXiv Detail & Related papers (2022-06-09T10:39:33Z) - Identification of Subgroups With Similar Benefits in Off-Policy Policy
Evaluation [60.71312668265873]
We develop a method to balance the need for personalization with confident predictions.
We show that our method can be used to form accurate predictions of heterogeneous treatment effects.
arXiv Detail & Related papers (2021-11-28T23:19:12Z) - Off-Policy Evaluation of Bandit Algorithm from Dependent Samples under
Batch Update Policy [8.807587076209566]
The goal of off-policy evaluation (OPE) is to evaluate a new policy using historical data obtained via a behavior policy.
Because the contextual bandit updates the policy based on past observations, the samples are not independent and identically distributed.
This paper tackles this problem by constructing an estimator from a martingale difference sequence (MDS) for the dependent samples.
arXiv Detail & Related papers (2020-10-23T15:22:57Z) - Confidence Interval for Off-Policy Evaluation from Dependent Samples via
Bandit Algorithm: Approach from Standardized Martingales [8.807587076209566]
The goal of OPE is to evaluate a new policy using historical data obtained from behavior policies generated by the bandit algorithm.
Because the bandit algorithm updates the policy based on past observations, the samples are not independent and identically distributed (i.i.d.)
Several existing methods for OPE do not take this issue into account and are based on the assumption that samples are i.i.d.
arXiv Detail & Related papers (2020-06-12T07:48:04Z) - Distributionally Robust Batch Contextual Bandits [20.667213458836734]
Policy learning using historical observational data is an important problem that has found widespread applications.
Existing literature rests on the crucial assumption that the future environment where the learned policy will be deployed is the same as the past environment.
In this paper, we lift this assumption and aim to learn a distributionally robust policy with incomplete observational data.
arXiv Detail & Related papers (2020-06-10T03:11:40Z) - Doubly Robust Off-Policy Value and Gradient Estimation for Deterministic
Policies [80.42316902296832]
We study the estimation of policy value and gradient of a deterministic policy from off-policy data when actions are continuous.
In this setting, standard importance sampling and doubly robust estimators for policy value and gradient fail because the density ratio does not exist.
We propose several new doubly robust estimators based on different kernelization approaches.
arXiv Detail & Related papers (2020-06-06T15:52:05Z) - Learning Robust Decision Policies from Observational Data [21.05564340986074]
It is of interest to learn robust policies that reduce the risk of outcomes with high costs.
We develop a method for learning policies that reduce tails of the cost distribution at a specified level.
arXiv Detail & Related papers (2020-06-03T16:02:57Z) - Minimax-Optimal Off-Policy Evaluation with Linear Function Approximation [49.502277468627035]
This paper studies the statistical theory of batch data reinforcement learning with function approximation.
Consider the off-policy evaluation problem, which is to estimate the cumulative value of a new target policy from logged history.
arXiv Detail & Related papers (2020-02-21T19:20:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.