Off-Policy Evaluation and Learning for External Validity under a
Covariate Shift
- URL: http://arxiv.org/abs/2002.11642v3
- Date: Fri, 16 Oct 2020 01:40:41 GMT
- Title: Off-Policy Evaluation and Learning for External Validity under a
Covariate Shift
- Authors: Masahiro Kato, Masatoshi Uehara, Shota Yasui
- Abstract summary: We consider evaluating and training a new policy for the evaluation data by using the historical data obtained from a different policy.
The goal of off-policy evaluation (OPE) is to estimate the expected reward of a new policy over the evaluation data, and that of off-policy learning (OPL) is to find a new policy that maximizes the expected reward over the evaluation data.
- Score: 32.37842308026544
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We consider evaluating and training a new policy for the evaluation data by
using the historical data obtained from a different policy. The goal of
off-policy evaluation (OPE) is to estimate the expected reward of a new policy
over the evaluation data, and that of off-policy learning (OPL) is to find a
new policy that maximizes the expected reward over the evaluation data.
Although the standard OPE and OPL assume the same distribution of covariate
between the historical and evaluation data, a covariate shift often exists,
i.e., the distribution of the covariate of the historical data is different
from that of the evaluation data. In this paper, we derive the efficiency bound
of OPE under a covariate shift. Then, we propose doubly robust and efficient
estimators for OPE and OPL under a covariate shift by using a nonparametric
estimator of the density ratio between the historical and evaluation data
distributions. We also discuss other possible estimators and compare their
theoretical properties. Finally, we confirm the effectiveness of the proposed
estimators through experiments.
Related papers
- Combining Experimental and Historical Data for Policy Evaluation [17.89146022336492]
We propose novel data integration methods that linearly integrate base policy value estimators constructed based on the experimental and historical data.
We derive their robustness, efficiency and properties across a broad spectrum of reward shift scenarios.
Numerical experiments and real-data-based analyses from a ridesharing company demonstrate the superior performance of the proposed estimators.
arXiv Detail & Related papers (2024-06-01T06:26:28Z) - OPERA: Automatic Offline Policy Evaluation with Re-weighted Aggregates of Multiple Estimators [13.408838970377035]
offline policy evaluation (OPE) allows us to evaluate and estimate a new sequential decision-making policy's performance.
We propose a new algorithm that adaptively blends a set of OPE estimators given a dataset without relying on an explicit selection using a statistical procedure.
Our work contributes to improving ease of use for a general-purpose, estimator-agnostic, off-policy evaluation framework for offline RL.
arXiv Detail & Related papers (2024-05-27T23:51:20Z) - Sample Complexity of Preference-Based Nonparametric Off-Policy
Evaluation with Deep Networks [58.469818546042696]
We study the sample efficiency of OPE with human preference and establish a statistical guarantee for it.
By appropriately selecting the size of a ReLU network, we show that one can leverage any low-dimensional manifold structure in the Markov decision process.
arXiv Detail & Related papers (2023-10-16T16:27:06Z) - Conformal Off-Policy Evaluation in Markov Decision Processes [53.786439742572995]
Reinforcement Learning aims at identifying and evaluating efficient control policies from data.
Most methods for this learning task, referred to as Off-Policy Evaluation (OPE), do not come with accuracy and certainty guarantees.
We present a novel OPE method based on Conformal Prediction that outputs an interval containing the true reward of the target policy with a prescribed level of certainty.
arXiv Detail & Related papers (2023-04-05T16:45:11Z) - Uncertainty-Aware Instance Reweighting for Off-Policy Learning [63.31923483172859]
We propose a Uncertainty-aware Inverse Propensity Score estimator (UIPS) for improved off-policy learning.
Experiment results on synthetic and three real-world recommendation datasets demonstrate the advantageous sample efficiency of the proposed UIPS estimator.
arXiv Detail & Related papers (2023-03-11T11:42:26Z) - Asymptotically Unbiased Off-Policy Policy Evaluation when Reusing Old
Data in Nonstationary Environments [31.492146288630515]
We introduce a variant of the doubly robust (DR) estimator, called the regression-assisted DR estimator, that can incorporate the past data without introducing a large bias.
We empirically show that the new estimator improves estimation for the current and future policy values, and provides a tight and valid interval estimation in several nonstationary recommendation environments.
arXiv Detail & Related papers (2023-02-23T01:17:21Z) - Improved Policy Evaluation for Randomized Trials of Algorithmic Resource
Allocation [54.72195809248172]
We present a new estimator leveraging our proposed novel concept, that involves retrospective reshuffling of participants across experimental arms at the end of an RCT.
We prove theoretically that such an estimator is more accurate than common estimators based on sample means.
arXiv Detail & Related papers (2023-02-06T05:17:22Z) - Variance-Aware Off-Policy Evaluation with Linear Function Approximation [85.75516599931632]
We study the off-policy evaluation problem in reinforcement learning with linear function approximation.
We propose an algorithm, VA-OPE, which uses the estimated variance of the value function to reweight the Bellman residual in Fitted Q-Iteration.
arXiv Detail & Related papers (2021-06-22T17:58:46Z) - Offline Policy Comparison under Limited Historical Agent-Environment
Interactions [0.0]
We address the challenge of policy evaluation in real-world applications of reinforcement learning systems.
We propose that one should perform policy comparison, i.e. to rank the policies of interest in terms of their value based on available historical data.
arXiv Detail & Related papers (2021-06-07T19:51:00Z) - Off-Policy Evaluation via Adaptive Weighting with Data from Contextual
Bandits [5.144809478361604]
We improve the doubly robust (DR) estimator by adaptively weighting observations to control its variance.
We provide empirical evidence for our estimator's improved accuracy and inferential properties relative to existing alternatives.
arXiv Detail & Related papers (2021-06-03T17:54:44Z) - Minimax-Optimal Off-Policy Evaluation with Linear Function Approximation [49.502277468627035]
This paper studies the statistical theory of batch data reinforcement learning with function approximation.
Consider the off-policy evaluation problem, which is to estimate the cumulative value of a new target policy from logged history.
arXiv Detail & Related papers (2020-02-21T19:20:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.