Practical Improvements of A/B Testing with Off-Policy Estimation
- URL: http://arxiv.org/abs/2506.10677v2
- Date: Fri, 13 Jun 2025 06:11:04 GMT
- Title: Practical Improvements of A/B Testing with Off-Policy Estimation
- Authors: Otmane Sakhi, Alexandre Gilotte, David Rohde,
- Abstract summary: We introduce a family of unbiased off-policy estimators that achieves lower variance than the standard approach.<n>Our theoretical analysis and experimental results validate the effectiveness and practicality of the proposed method.
- Score: 51.25970890274447
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We address the problem of A/B testing, a widely used protocol for evaluating the potential improvement achieved by a new decision system compared to a baseline. This protocol segments the population into two subgroups, each exposed to a version of the system and estimates the improvement as the difference between the measured effects. In this work, we demonstrate that the commonly used difference-in-means estimator, while unbiased, can be improved. We introduce a family of unbiased off-policy estimators that achieves lower variance than the standard approach. Among this family, we identify the estimator with the lowest variance. The resulting estimator is simple, and offers substantial variance reduction when the two tested systems exhibit similarities. Our theoretical analysis and experimental results validate the effectiveness and practicality of the proposed method.
Related papers
- Post Launch Evaluation of Policies in a High-Dimensional Setting [4.710921988115686]
A/B tests, also known as randomized controlled experiments (RCTs), are the gold standard for evaluating the impact of new policies, products, or decisions.<n>This paper explores practical considerations in applying methodologies inspired by "synthetic control"<n>Synthetic control methods leverage data from unaffected units to estimate counterfactual outcomes for treated units.
arXiv Detail & Related papers (2024-12-30T19:35:29Z) - Exogenous Matching: Learning Good Proposals for Tractable Counterfactual Estimation [1.9662978733004601]
We propose an importance sampling method for tractable and efficient estimation of counterfactual expressions.<n>By minimizing a common upper bound of counterfactual estimators, we transform the variance minimization problem into a conditional distribution learning problem.<n>We validate the theoretical results through experiments under various types and settings of Structural Causal Models (SCMs) and demonstrate the outperformance on counterfactual estimation tasks.
arXiv Detail & Related papers (2024-10-17T03:08:28Z) - Optimal Baseline Corrections for Off-Policy Contextual Bandits [61.740094604552475]
We aim to learn decision policies that optimize an unbiased offline estimate of an online reward metric.
We propose a single framework built on their equivalence in learning scenarios.
Our framework enables us to characterize the variance-optimal unbiased estimator and provide a closed-form solution for it.
arXiv Detail & Related papers (2024-05-09T12:52:22Z) - Variance Reduction in Ratio Metrics for Efficient Online Experiments [12.036747050794135]
We apply variance reduction techniques to ratio metrics on a large-scale short-video platform: ShareChat.
Our results show that we can either improve A/B-test confidence in 77% of cases, or can retain the same level of confidence with 30% fewer data points.
arXiv Detail & Related papers (2024-01-08T18:01:09Z) - Individualized Policy Evaluation and Learning under Clustered Network Interference [3.8601741392210434]
We consider the problem of evaluating and learning an optimal individualized treatment rule (ITR) under clustered network interference.<n>We propose an estimator that can be used to evaluate the empirical performance of an ITR.<n>We derive the finite-sample regret bound for a learned ITR, showing that the use of our efficient evaluation estimator leads to the improved performance of learned policies.
arXiv Detail & Related papers (2023-11-04T17:58:24Z) - Insufficiently Justified Disparate Impact: A New Criterion for Subgroup
Fairness [1.9346186297861747]
We develop a new criterion, "insufficiently justified disparate impact" (IJDI)
Our novel, utility-based IJDI criterion evaluates false positive and false negative error rate imbalances.
We describe a novel IJDI-Scan approach which can efficiently identify the intersectional subpopulations.
arXiv Detail & Related papers (2023-06-19T22:10:24Z) - Robust Bayesian Subspace Identification for Small Data Sets [91.3755431537592]
We propose regularized estimators, shrinkage estimators and Bayesian estimation to reduce the effect of variance.
Our experimental results show that our proposed estimators may reduce the estimation risk up to $40%$ of that of traditional subspace methods.
arXiv Detail & Related papers (2022-12-29T00:29:04Z) - Learning to Estimate Without Bias [57.82628598276623]
Gauss theorem states that the weighted least squares estimator is a linear minimum variance unbiased estimation (MVUE) in linear models.
In this paper, we take a first step towards extending this result to non linear settings via deep learning with bias constraints.
A second motivation to BCE is in applications where multiple estimates of the same unknown are averaged for improved performance.
arXiv Detail & Related papers (2021-10-24T10:23:51Z) - Expected Validation Performance and Estimation of a Random Variable's
Maximum [48.83713377993604]
We analyze three statistical estimators for expected validation performance.
We find the unbiased estimator has the highest variance, and the estimator with the smallest variance has the largest bias.
We find that the two biased estimators lead to the fewest incorrect conclusions.
arXiv Detail & Related papers (2021-10-01T18:48:47Z) - A Statistical Analysis of Summarization Evaluation Metrics using
Resampling Methods [60.04142561088524]
We find that the confidence intervals are rather wide, demonstrating high uncertainty in how reliable automatic metrics truly are.
Although many metrics fail to show statistical improvements over ROUGE, two recent works, QAEval and BERTScore, do in some evaluation settings.
arXiv Detail & Related papers (2021-03-31T18:28:14Z) - Robust and flexible learning of a high-dimensional classification rule
using auxiliary outcomes [2.92281985958308]
We develop a transfer learning approach to estimating a high-dimensional linear decision rule with the presence of auxiliary outcomes.
We show that the final estimator can achieve a lower estimation error than the one using only the single outcome of interest.
arXiv Detail & Related papers (2020-11-11T01:14:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.