When is Off-Policy Evaluation Useful? A Data-Centric Perspective
- URL: http://arxiv.org/abs/2311.14110v1
- Date: Thu, 23 Nov 2023 17:13:37 GMT
- Title: When is Off-Policy Evaluation Useful? A Data-Centric Perspective
- Authors: Hao Sun, Alex J. Chan, Nabeel Seedat, Alihan H\"uy\"uk, Mihaela van
der Schaar
- Abstract summary: evaluating the value of a hypothetical target policy with only a logged dataset is important but challenging.
We propose DataCOPE, a data-centric framework for evaluating a target policy given a dataset.
- Score: 60.76880827781716
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Evaluating the value of a hypothetical target policy with only a logged
dataset is important but challenging. On the one hand, it brings opportunities
for safe policy improvement under high-stakes scenarios like clinical
guidelines. On the other hand, such opportunities raise a need for precise
off-policy evaluation (OPE). While previous work on OPE focused on improving
the algorithm in value estimation, in this work, we emphasize the importance of
the offline dataset, hence putting forward a data-centric framework for
evaluating OPE problems. We propose DataCOPE, a data-centric framework for
evaluating OPE, that answers the questions of whether and to what extent we can
evaluate a target policy given a dataset. DataCOPE (1) forecasts the overall
performance of OPE algorithms without access to the environment, which is
especially useful before real-world deployment where evaluating OPE is
impossible; (2) identifies the sub-group in the dataset where OPE can be
inaccurate; (3) permits evaluations of datasets or data-collection strategies
for OPE problems. Our empirical analysis of DataCOPE in the logged contextual
bandit settings using healthcare datasets confirms its ability to evaluate both
machine-learning and human expert policies like clinical guidelines.
Related papers
- OPERA: Automatic Offline Policy Evaluation with Re-weighted Aggregates of Multiple Estimators [13.408838970377035]
offline policy evaluation (OPE) allows us to evaluate and estimate a new sequential decision-making policy's performance.
We propose a new algorithm that adaptively blends a set of OPE estimators given a dataset without relying on an explicit selection using a statistical procedure.
Our work contributes to improving ease of use for a general-purpose, estimator-agnostic, off-policy evaluation framework for offline RL.
arXiv Detail & Related papers (2024-05-27T23:51:20Z) - Sample Complexity of Preference-Based Nonparametric Off-Policy
Evaluation with Deep Networks [58.469818546042696]
We study the sample efficiency of OPE with human preference and establish a statistical guarantee for it.
By appropriately selecting the size of a ReLU network, we show that one can leverage any low-dimensional manifold structure in the Markov decision process.
arXiv Detail & Related papers (2023-10-16T16:27:06Z) - Policy-Adaptive Estimator Selection for Off-Policy Evaluation [12.1655494876088]
Off-policy evaluation (OPE) aims to accurately evaluate the performance of counterfactual policies using only offline logged data.
This paper studies this challenging problem of estimator selection for OPE for the first time.
In particular, we enable an estimator selection that is adaptive to a given OPE task, by appropriately subsampling available logged data and constructing pseudo policies.
arXiv Detail & Related papers (2022-11-25T05:31:42Z) - Reinforcement Learning with Heterogeneous Data: Estimation and Inference [84.72174994749305]
We introduce the K-Heterogeneous Markov Decision Process (K-Hetero MDP) to address sequential decision problems with population heterogeneity.
We propose the Auto-Clustered Policy Evaluation (ACPE) for estimating the value of a given policy, and the Auto-Clustered Policy Iteration (ACPI) for estimating the optimal policy in a given policy class.
We present simulations to support our theoretical findings, and we conduct an empirical study on the standard MIMIC-III dataset.
arXiv Detail & Related papers (2022-01-31T20:58:47Z) - Robust On-Policy Data Collection for Data-Efficient Policy Evaluation [7.745028845389033]
In policy evaluation, the task is to estimate the expected return of an evaluation policy on an environment of interest.
We consider a setting where we can collect a small amount of additional data to combine with a potentially larger offline RL dataset.
We show that simply running the evaluation policy -- on-policy data collection -- is sub-optimal for this setting.
arXiv Detail & Related papers (2021-11-29T14:30:26Z) - Evaluating the Robustness of Off-Policy Evaluation [10.760026478889664]
Off-policy Evaluation (OPE) evaluates the performance of hypothetical policies leveraging only offline log data.
It is particularly useful in applications where the online interaction involves high stakes and expensive setting.
We develop Interpretable Evaluation for Offline Evaluation (IEOE), an experimental procedure to evaluate OPE estimators' robustness.
arXiv Detail & Related papers (2021-08-31T09:33:13Z) - Benchmarks for Deep Off-Policy Evaluation [152.28569758144022]
We present a collection of policies that can be used for benchmarking off-policy evaluation.
The goal of our benchmark is to provide a standardized measure of progress that is motivated from a set of principles.
We provide open-source access to our data and code to foster future research in this area.
arXiv Detail & Related papers (2021-03-30T18:09:33Z) - Reliable Off-policy Evaluation for Reinforcement Learning [53.486680020852724]
In a sequential decision-making problem, off-policy evaluation estimates the expected cumulative reward of a target policy.
We propose a novel framework that provides robust and optimistic cumulative reward estimates using one or multiple logged data.
arXiv Detail & Related papers (2020-11-08T23:16:19Z) - Interpretable Off-Policy Evaluation in Reinforcement Learning by
Highlighting Influential Transitions [48.91284724066349]
Off-policy evaluation in reinforcement learning offers the chance of using observational data to improve future outcomes in domains such as healthcare and education.
Traditional measures such as confidence intervals may be insufficient due to noise, limited data and confounding.
We develop a method that could serve as a hybrid human-AI system, to enable human experts to analyze the validity of policy evaluation estimates.
arXiv Detail & Related papers (2020-02-10T00:26:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.