Asymptotically Unbiased Off-Policy Policy Evaluation when Reusing Old
Data in Nonstationary Environments
- URL: http://arxiv.org/abs/2302.11725v1
- Date: Thu, 23 Feb 2023 01:17:21 GMT
- Title: Asymptotically Unbiased Off-Policy Policy Evaluation when Reusing Old
Data in Nonstationary Environments
- Authors: Vincent Liu, Yash Chandak, Philip Thomas, Martha White
- Abstract summary: We introduce a variant of the doubly robust (DR) estimator, called the regression-assisted DR estimator, that can incorporate the past data without introducing a large bias.
We empirically show that the new estimator improves estimation for the current and future policy values, and provides a tight and valid interval estimation in several nonstationary recommendation environments.
- Score: 31.492146288630515
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this work, we consider the off-policy policy evaluation problem for
contextual bandits and finite horizon reinforcement learning in the
nonstationary setting. Reusing old data is critical for policy evaluation, but
existing estimators that reuse old data introduce large bias such that we can
not obtain a valid confidence interval. Inspired from a related field called
survey sampling, we introduce a variant of the doubly robust (DR) estimator,
called the regression-assisted DR estimator, that can incorporate the past data
without introducing a large bias. The estimator unifies several existing
off-policy policy evaluation methods and improves on them with the use of
auxiliary information and a regression approach. We prove that the new
estimator is asymptotically unbiased, and provide a consistent variance
estimator to a construct a large sample confidence interval. Finally, we
empirically show that the new estimator improves estimation for the current and
future policy values, and provides a tight and valid interval estimation in
several nonstationary recommendation environments.
Related papers
- OPERA: Automatic Offline Policy Evaluation with Re-weighted Aggregates of Multiple Estimators [13.408838970377035]
offline policy evaluation (OPE) allows us to evaluate and estimate a new sequential decision-making policy's performance.
We propose a new algorithm that adaptively blends a set of OPE estimators given a dataset without relying on an explicit selection using a statistical procedure.
Our work contributes to improving ease of use for a general-purpose, estimator-agnostic, off-policy evaluation framework for offline RL.
arXiv Detail & Related papers (2024-05-27T23:51:20Z) - Counterfactual-Augmented Importance Sampling for Semi-Offline Policy
Evaluation [13.325600043256552]
We propose a semi-offline evaluation framework, where human users provide annotations of unobserved counterfactual trajectories.
Our framework, combined with principled human-centered design of annotation solicitation, can enable the application of reinforcement learning in high-stakes domains.
arXiv Detail & Related papers (2023-10-26T04:41:19Z) - Uncertainty-Aware Instance Reweighting for Off-Policy Learning [63.31923483172859]
We propose a Uncertainty-aware Inverse Propensity Score estimator (UIPS) for improved off-policy learning.
Experiment results on synthetic and three real-world recommendation datasets demonstrate the advantageous sample efficiency of the proposed UIPS estimator.
arXiv Detail & Related papers (2023-03-11T11:42:26Z) - Off-policy evaluation for learning-to-rank via interpolating the
item-position model and the position-based model [83.83064559894989]
A critical need for industrial recommender systems is the ability to evaluate recommendation policies offline, before deploying them to production.
We develop a new estimator that mitigates the problems of the two most popular off-policy estimators for rankings.
In particular, the new estimator, called INTERPOL, addresses the bias of a potentially misspecified position-based model.
arXiv Detail & Related papers (2022-10-15T17:22:30Z) - Off-Policy Evaluation via Adaptive Weighting with Data from Contextual
Bandits [5.144809478361604]
We improve the doubly robust (DR) estimator by adaptively weighting observations to control its variance.
We provide empirical evidence for our estimator's improved accuracy and inferential properties relative to existing alternatives.
arXiv Detail & Related papers (2021-06-03T17:54:44Z) - Post-Contextual-Bandit Inference [57.88785630755165]
Contextual bandit algorithms are increasingly replacing non-adaptive A/B tests in e-commerce, healthcare, and policymaking.
They can both improve outcomes for study participants and increase the chance of identifying good or even best policies.
To support credible inference on novel interventions at the end of the study, we still want to construct valid confidence intervals on average treatment effects, subgroup effects, or value of new policies.
arXiv Detail & Related papers (2021-06-01T12:01:51Z) - Reliable Off-policy Evaluation for Reinforcement Learning [53.486680020852724]
In a sequential decision-making problem, off-policy evaluation estimates the expected cumulative reward of a target policy.
We propose a novel framework that provides robust and optimistic cumulative reward estimates using one or multiple logged data.
arXiv Detail & Related papers (2020-11-08T23:16:19Z) - Accountable Off-Policy Evaluation With Kernel Bellman Statistics [29.14119984573459]
We consider off-policy evaluation (OPE), which evaluates the performance of a new policy from observed data collected from previous experiments.
Due to the limited information from off-policy data, it is highly desirable to construct rigorous confidence intervals, not just point estimation.
We propose a new variational framework which reduces the problem of calculating tight confidence bounds in OPE.
arXiv Detail & Related papers (2020-08-15T07:24:38Z) - Minimax-Optimal Off-Policy Evaluation with Linear Function Approximation [49.502277468627035]
This paper studies the statistical theory of batch data reinforcement learning with function approximation.
Consider the off-policy evaluation problem, which is to estimate the cumulative value of a new target policy from logged history.
arXiv Detail & Related papers (2020-02-21T19:20:57Z) - Confounding-Robust Policy Evaluation in Infinite-Horizon Reinforcement
Learning [70.01650994156797]
Off- evaluation of sequential decision policies from observational data is necessary in batch reinforcement learning such as education healthcare.
We develop an approach that estimates the bounds of a given policy.
We prove convergence to the sharp bounds as we collect more confounded data.
arXiv Detail & Related papers (2020-02-11T16:18:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.