Counterfactual-Augmented Importance Sampling for Semi-Offline Policy
Evaluation
- URL: http://arxiv.org/abs/2310.17146v1
- Date: Thu, 26 Oct 2023 04:41:19 GMT
- Title: Counterfactual-Augmented Importance Sampling for Semi-Offline Policy
Evaluation
- Authors: Shengpu Tang, Jenna Wiens
- Abstract summary: We propose a semi-offline evaluation framework, where human users provide annotations of unobserved counterfactual trajectories.
Our framework, combined with principled human-centered design of annotation solicitation, can enable the application of reinforcement learning in high-stakes domains.
- Score: 13.325600043256552
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In applying reinforcement learning (RL) to high-stakes domains, quantitative
and qualitative evaluation using observational data can help practitioners
understand the generalization performance of new policies. However, this type
of off-policy evaluation (OPE) is inherently limited since offline data may not
reflect the distribution shifts resulting from the application of new policies.
On the other hand, online evaluation by collecting rollouts according to the
new policy is often infeasible, as deploying new policies in these domains can
be unsafe. In this work, we propose a semi-offline evaluation framework as an
intermediate step between offline and online evaluation, where human users
provide annotations of unobserved counterfactual trajectories. While tempting
to simply augment existing data with such annotations, we show that this naive
approach can lead to biased results. Instead, we design a new family of OPE
estimators based on importance sampling (IS) and a novel weighting scheme that
incorporate counterfactual annotations without introducing additional bias. We
analyze the theoretical properties of our approach, showing its potential to
reduce both bias and variance compared to standard IS estimators. Our analyses
reveal important practical considerations for handling biased, noisy, or
missing annotations. In a series of proof-of-concept experiments involving
bandits and a healthcare-inspired simulator, we demonstrate that our approach
outperforms purely offline IS estimators and is robust to imperfect
annotations. Our framework, combined with principled human-centered design of
annotation solicitation, can enable the application of RL in high-stakes
domains.
Related papers
- Is user feedback always informative? Retrieval Latent Defending for Semi-Supervised Domain Adaptation without Source Data [34.55109747972333]
This paper aims to adapt the source model to the target environment using user feedback readily available in real-world applications.
We analyze this phenomenon via a novel concept called Negatively Biased Feedback (NBF)
We propose a scalable adapting approach, Retrieval Latent Defending.
arXiv Detail & Related papers (2024-07-22T05:15:41Z) - Towards Evaluating Transfer-based Attacks Systematically, Practically,
and Fairly [79.07074710460012]
adversarial vulnerability of deep neural networks (DNNs) has drawn great attention.
An increasing number of transfer-based methods have been developed to fool black-box DNN models.
We establish a transfer-based attack benchmark (TA-Bench) which implements 30+ methods.
arXiv Detail & Related papers (2023-11-02T15:35:58Z) - Uncertainty-Aware Instance Reweighting for Off-Policy Learning [63.31923483172859]
We propose a Uncertainty-aware Inverse Propensity Score estimator (UIPS) for improved off-policy learning.
Experiment results on synthetic and three real-world recommendation datasets demonstrate the advantageous sample efficiency of the proposed UIPS estimator.
arXiv Detail & Related papers (2023-03-11T11:42:26Z) - Asymptotically Unbiased Off-Policy Policy Evaluation when Reusing Old
Data in Nonstationary Environments [31.492146288630515]
We introduce a variant of the doubly robust (DR) estimator, called the regression-assisted DR estimator, that can incorporate the past data without introducing a large bias.
We empirically show that the new estimator improves estimation for the current and future policy values, and provides a tight and valid interval estimation in several nonstationary recommendation environments.
arXiv Detail & Related papers (2023-02-23T01:17:21Z) - Off-policy evaluation for learning-to-rank via interpolating the
item-position model and the position-based model [83.83064559894989]
A critical need for industrial recommender systems is the ability to evaluate recommendation policies offline, before deploying them to production.
We develop a new estimator that mitigates the problems of the two most popular off-policy estimators for rankings.
In particular, the new estimator, called INTERPOL, addresses the bias of a potentially misspecified position-based model.
arXiv Detail & Related papers (2022-10-15T17:22:30Z) - Off-Policy Evaluation for Large Action Spaces via Embeddings [36.42838320396534]
Off-policy evaluation (OPE) in contextual bandits has seen rapid adoption in real-world systems.
Existing OPE estimators degrade severely when the number of actions is large.
We propose a new OPE estimator that leverages marginalized importance weights when action embeddings provide structure in the action space.
arXiv Detail & Related papers (2022-02-13T14:00:09Z) - Unifying Gradient Estimators for Meta-Reinforcement Learning via
Off-Policy Evaluation [53.83642844626703]
We provide a unifying framework for estimating higher-order derivatives of value functions, based on off-policy evaluation.
Our framework interprets a number of prior approaches as special cases and elucidates the bias and variance trade-off of Hessian estimates.
arXiv Detail & Related papers (2021-06-24T15:58:01Z) - GenDICE: Generalized Offline Estimation of Stationary Values [108.17309783125398]
We show that effective estimation can still be achieved in important applications.
Our approach is based on estimating a ratio that corrects for the discrepancy between the stationary and empirical distributions.
The resulting algorithm, GenDICE, is straightforward and effective.
arXiv Detail & Related papers (2020-02-21T00:27:52Z) - Confounding-Robust Policy Evaluation in Infinite-Horizon Reinforcement
Learning [70.01650994156797]
Off- evaluation of sequential decision policies from observational data is necessary in batch reinforcement learning such as education healthcare.
We develop an approach that estimates the bounds of a given policy.
We prove convergence to the sharp bounds as we collect more confounded data.
arXiv Detail & Related papers (2020-02-11T16:18:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.