Related papers: Variance-Optimal Augmentation Logging for Counterfactual Evaluation in Contextual Bandits

Variance-Optimal Augmentation Logging for Counterfactual Evaluation in Contextual Bandits

URL: http://arxiv.org/abs/2202.01721v1
Date: Thu, 3 Feb 2022 17:37:11 GMT
Title: Variance-Optimal Augmentation Logging for Counterfactual Evaluation in Contextual Bandits
Authors: Aaron David Tucker and Thorsten Joachims
Abstract summary: Methods for offline A/B testing and counterfactual learning are seeing rapid adoption in search and recommender systems. The counterfactual estimators that are commonly used in these methods can have large bias and large variance when the logging policy is very different from the target policy being evaluated. This paper introduces Minimum Variance Augmentation Logging (MVAL), a method for constructing logging policies that minimize the variance of the downstream evaluation or learning problem.
Score: 25.153656462604268
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Methods for offline A/B testing and counterfactual learning are seeing rapid adoption in search and recommender systems, since they allow efficient reuse of existing log data. However, there are fundamental limits to using existing log data alone, since the counterfactual estimators that are commonly used in these methods can have large bias and large variance when the logging policy is very different from the target policy being evaluated. To overcome this limitation, we explore the question of how to design data-gathering policies that most effectively augment an existing dataset of bandit feedback with additional observations for both learning and evaluation. To this effect, this paper introduces Minimum Variance Augmentation Logging (MVAL), a method for constructing logging policies that minimize the variance of the downstream evaluation or learning problem. We explore multiple approaches to computing MVAL policies efficiently, and find that they can be substantially more effective in decreasing the variance of an estimator than na\"ive approaches.

Related papers

Achieving $\widetilde{\mathcal{O}}(\sqrt{T})$ Regret in Average-Reward POMDPs with Known Observation Models [56.92178753201331]
We tackle average-reward infinite-horizon POMDPs with an unknown transition model. We present a novel and simple estimator that overcomes this barrier.
arXiv Detail & Related papers (2025-01-30T22:29:41Z)
Off-Policy Evaluation for Large Action Spaces via Policy Convolution [60.6953713877886]
Policy Convolution family of estimators uses latent structure within actions to strategically convolve the logging and target policies. Experiments on synthetic and benchmark datasets demonstrate remarkable mean squared error (MSE) improvements when using PC.
arXiv Detail & Related papers (2023-10-24T01:00:01Z)
Uncertainty-Aware Instance Reweighting for Off-Policy Learning [63.31923483172859]
We propose a Uncertainty-aware Inverse Propensity Score estimator (UIPS) for improved off-policy learning. Experiment results on synthetic and three real-world recommendation datasets demonstrate the advantageous sample efficiency of the proposed UIPS estimator.
arXiv Detail & Related papers (2023-03-11T11:42:26Z)
Bootstrap Advantage Estimation for Policy Optimization in Reinforcement Learning [16.999444076456268]
This paper proposes an advantage estimation approach based on data augmentation for policy optimization. Our method uses data augmentation to compute a bootstrap advantage estimation. We observe that our method reduces the policy and the value loss better than the Generalized advantage estimation.
arXiv Detail & Related papers (2022-10-13T19:30:43Z)
Semi-supervised Batch Learning From Logged Data [24.826544828460158]
We build on the counterfactual risk minimization framework, which also assumes access to propensity scores. We propose learning methods for problems where feedback is missing for some samples, so there are samples with feedback and samples missing-feedback in the logged data.
arXiv Detail & Related papers (2022-09-15T08:58:28Z)
Offline Policy Optimization with Eligible Actions [34.4530766779594]
offline policy optimization could have a large impact on many real-world decision-making problems. Importance sampling and its variants are a commonly used type of estimator in offline policy evaluation. We propose an algorithm to avoid this overfitting through a new per-state-neighborhood normalization constraint.
arXiv Detail & Related papers (2022-07-01T19:18:15Z)
Optimal Off-Policy Evaluation from Multiple Logging Policies [77.62012545592233]
We study off-policy evaluation from multiple logging policies, each generating a dataset of fixed size, i.e., stratified sampling. We find the OPE estimator for multiple loggers with minimum variance for any instance, i.e., the efficient one.
arXiv Detail & Related papers (2020-10-21T13:43:48Z)
Taking the Counterfactual Online: Efficient and Unbiased Online Evaluation for Ranking [74.46448041224247]
We introduce the novel Logging-Policy Optimization Algorithm (LogOpt), which optimize the policy for logging data. LogOpt turns the counterfactual approach - which is indifferent to the logging policy - into an online approach, where the algorithm decides what rankings to display. We prove that, as an online evaluation method, LogOpt is unbiased w.r.t. position and item-selection bias, unlike existing interleaving methods.
arXiv Detail & Related papers (2020-07-24T18:05:58Z)
Minimax-Optimal Off-Policy Evaluation with Linear Function Approximation [49.502277468627035]
This paper studies the statistical theory of batch data reinforcement learning with function approximation. Consider the off-policy evaluation problem, which is to estimate the cumulative value of a new target policy from logged history.
arXiv Detail & Related papers (2020-02-21T19:20:57Z)
Efficient Policy Learning from Surrogate-Loss Classification Reductions [65.91730154730905]
We consider the estimation problem given by a weighted surrogate-loss classification reduction of policy learning. We show that, under a correct specification assumption, the weighted classification formulation need not be efficient for policy parameters. We propose an estimation approach based on generalized method of moments, which is efficient for the policy parameters.
arXiv Detail & Related papers (2020-02-12T18:54:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.