Policy-Adaptive Estimator Selection for Off-Policy Evaluation
- URL: http://arxiv.org/abs/2211.13904v1
- Date: Fri, 25 Nov 2022 05:31:42 GMT
- Title: Policy-Adaptive Estimator Selection for Off-Policy Evaluation
- Authors: Takuma Udagawa, Haruka Kiyohara, Yusuke Narita, Yuta Saito, Kei Tateno
- Abstract summary: Off-policy evaluation (OPE) aims to accurately evaluate the performance of counterfactual policies using only offline logged data.
This paper studies this challenging problem of estimator selection for OPE for the first time.
In particular, we enable an estimator selection that is adaptive to a given OPE task, by appropriately subsampling available logged data and constructing pseudo policies.
- Score: 12.1655494876088
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Off-policy evaluation (OPE) aims to accurately evaluate the performance of
counterfactual policies using only offline logged data. Although many
estimators have been developed, there is no single estimator that dominates the
others, because the estimators' accuracy can vary greatly depending on a given
OPE task such as the evaluation policy, number of actions, and noise level.
Thus, the data-driven estimator selection problem is becoming increasingly
important and can have a significant impact on the accuracy of OPE. However,
identifying the most accurate estimator using only the logged data is quite
challenging because the ground-truth estimation accuracy of estimators is
generally unavailable. This paper studies this challenging problem of estimator
selection for OPE for the first time. In particular, we enable an estimator
selection that is adaptive to a given OPE task, by appropriately subsampling
available logged data and constructing pseudo policies useful for the
underlying estimator selection task. Comprehensive experiments on both
synthetic and real-world company data demonstrate that the proposed procedure
substantially improves the estimator selection compared to a non-adaptive
heuristic.
Related papers
- Automated Off-Policy Estimator Selection via Supervised Learning [7.476028372444458]
Off-Policy Evaluation (OPE) problem consists of evaluating the performance of counterfactual policies with data collected by another one.
To solve the OPE problem, we resort to estimators, which aim to estimate in the most accurate way possible the performance that the counterfactual policies would have had if they were deployed in place of the logging policy.
We propose an automated data-driven OPE estimator selection method based on supervised learning.
arXiv Detail & Related papers (2024-06-26T02:34:48Z) - OPERA: Automatic Offline Policy Evaluation with Re-weighted Aggregates of Multiple Estimators [13.408838970377035]
offline policy evaluation (OPE) allows us to evaluate and estimate a new sequential decision-making policy's performance.
We propose a new algorithm that adaptively blends a set of OPE estimators given a dataset without relying on an explicit selection using a statistical procedure.
Our work contributes to improving ease of use for a general-purpose, estimator-agnostic, off-policy evaluation framework for offline RL.
arXiv Detail & Related papers (2024-05-27T23:51:20Z) - Optimal Baseline Corrections for Off-Policy Contextual Bandits [61.740094604552475]
We aim to learn decision policies that optimize an unbiased offline estimate of an online reward metric.
We propose a single framework built on their equivalence in learning scenarios.
Our framework enables us to characterize the variance-optimal unbiased estimator and provide a closed-form solution for it.
arXiv Detail & Related papers (2024-05-09T12:52:22Z) - When is Off-Policy Evaluation (Reward Modeling) Useful in Contextual Bandits? A Data-Centric Perspective [64.73162159837956]
evaluating the value of a hypothetical target policy with only a logged dataset is important but challenging.
We propose DataCOPE, a data-centric framework for evaluating a target policy given a dataset.
Our empirical analysis of DataCOPE in the logged contextual bandit settings using healthcare datasets confirms its ability to evaluate both machine-learning and human expert policies.
arXiv Detail & Related papers (2023-11-23T17:13:37Z) - Uncertainty-Aware Instance Reweighting for Off-Policy Learning [63.31923483172859]
We propose a Uncertainty-aware Inverse Propensity Score estimator (UIPS) for improved off-policy learning.
Experiment results on synthetic and three real-world recommendation datasets demonstrate the advantageous sample efficiency of the proposed UIPS estimator.
arXiv Detail & Related papers (2023-03-11T11:42:26Z) - Data-Driven Off-Policy Estimator Selection: An Application in User
Marketing on An Online Content Delivery Service [11.986224119327387]
Off-policy evaluation is essential in domains such as healthcare, marketing or recommender systems.
Many OPE methods with theoretical backgrounds have been proposed.
It is often unknown for practitioners which estimator to use for their specific applications and purposes.
arXiv Detail & Related papers (2021-09-17T15:53:53Z) - Evaluating the Robustness of Off-Policy Evaluation [10.760026478889664]
Off-policy Evaluation (OPE) evaluates the performance of hypothetical policies leveraging only offline log data.
It is particularly useful in applications where the online interaction involves high stakes and expensive setting.
We develop Interpretable Evaluation for Offline Evaluation (IEOE), an experimental procedure to evaluate OPE estimators' robustness.
arXiv Detail & Related papers (2021-08-31T09:33:13Z) - Control Variates for Slate Off-Policy Evaluation [112.35528337130118]
We study the problem of off-policy evaluation from batched contextual bandit data with multidimensional actions.
We obtain new estimators with risk improvement guarantees over both the PI and self-normalized PI estimators.
arXiv Detail & Related papers (2021-06-15T06:59:53Z) - Reliable Off-policy Evaluation for Reinforcement Learning [53.486680020852724]
In a sequential decision-making problem, off-policy evaluation estimates the expected cumulative reward of a target policy.
We propose a novel framework that provides robust and optimistic cumulative reward estimates using one or multiple logged data.
arXiv Detail & Related papers (2020-11-08T23:16:19Z) - Optimal Off-Policy Evaluation from Multiple Logging Policies [77.62012545592233]
We study off-policy evaluation from multiple logging policies, each generating a dataset of fixed size, i.e., stratified sampling.
We find the OPE estimator for multiple loggers with minimum variance for any instance, i.e., the efficient one.
arXiv Detail & Related papers (2020-10-21T13:43:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.