A Common Misassumption in Online Experiments with Machine Learning
Models
- URL: http://arxiv.org/abs/2304.10900v1
- Date: Fri, 21 Apr 2023 11:36:44 GMT
- Title: A Common Misassumption in Online Experiments with Machine Learning
Models
- Authors: Olivier Jeunen
- Abstract summary: We argue that, because variants typically learn using pooled data, a lack of model interference cannot be guaranteed.
We discuss the implications this has for practitioners, and for the research literature.
- Score: 1.52292571922932
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Online experiments such as Randomised Controlled Trials (RCTs) or A/B-tests
are the bread and butter of modern platforms on the web. They are conducted
continuously to allow platforms to estimate the causal effect of replacing
system variant "A" with variant "B", on some metric of interest. These variants
can differ in many aspects. In this paper, we focus on the common use-case
where they correspond to machine learning models. The online experiment then
serves as the final arbiter to decide which model is superior, and should thus
be shipped.
The statistical literature on causal effect estimation from RCTs has a
substantial history, which contributes deservedly to the level of trust
researchers and practitioners have in this "gold standard" of evaluation
practices. Nevertheless, in the particular case of machine learning
experiments, we remark that certain critical issues remain. Specifically, the
assumptions that are required to ascertain that A/B-tests yield unbiased
estimates of the causal effect, are seldom met in practical applications. We
argue that, because variants typically learn using pooled data, a lack of model
interference cannot be guaranteed. This undermines the conclusions we can draw
from online experiments with machine learning models. We discuss the
implications this has for practitioners, and for the research literature.
Related papers
- Investigating the Robustness of Counterfactual Learning to Rank Models: A Reproducibility Study [61.64685376882383]
Counterfactual learning to rank (CLTR) has attracted extensive attention in the IR community for its ability to leverage massive logged user interaction data to train ranking models.
This paper investigates the robustness of existing CLTR models in complex and diverse situations.
We find that the DLA models and IPS-DCM show better robustness under various simulation settings than IPS-PBM and PRS with offline propensity estimation.
arXiv Detail & Related papers (2024-04-04T10:54:38Z) - Estimating Causal Effects with Double Machine Learning -- A Method Evaluation [5.904095466127043]
We review one of the most prominent methods - "double/debiased machine learning" (DML)
Our findings indicate that the application of a suitably flexible machine learning algorithm within DML improves the adjustment for various nonlinear confounding relationships.
When estimating the effects of air pollution on housing prices, we find that DML estimates are consistently larger than estimates of less flexible methods.
arXiv Detail & Related papers (2024-03-21T13:21:33Z) - TESSERACT: Eliminating Experimental Bias in Malware Classification
across Space and Time (Extended Version) [18.146377453918724]
Malware detectors often experience performance decay due to constantly evolving operating systems and attack methods.
This paper argues that commonly reported results are inflated due to two pervasive sources of experimental bias in the detection task.
arXiv Detail & Related papers (2024-02-02T12:27:32Z) - Too Good To Be True: performance overestimation in (re)current practices
for Human Activity Recognition [49.1574468325115]
sliding windows for data segmentation followed by standard random k-fold cross validation produce biased results.
It is important to raise awareness in the scientific community about this problem, whose negative effects are being overlooked.
Several experiments with different types of datasets and different types of classification models allow us to exhibit the problem and show it persists independently of the method or dataset.
arXiv Detail & Related papers (2023-10-18T13:24:05Z) - Using Auxiliary Data to Boost Precision in the Analysis of A/B Tests on
an Online Educational Platform: New Data and New Results [1.5293427903448025]
A/B tests allow causal effect estimation without confounding bias and exact statistical inference even in small samples.
Recent methodological advances have shown that power and statistical precision can be substantially boosted by coupling design-based causal estimation to machine-learning models of rich log data from historical users who were not in the experiment.
We show that the gains can be even larger for estimating subgroup effects, hold even when the remnant is unrepresentative of the A/B test sample, and extend to post-stratification population effects estimators.
arXiv Detail & Related papers (2023-06-09T21:54:36Z) - Intervention Generalization: A View from Factor Graph Models [7.117681268784223]
We take a close look at how to warrant a leap from past experiments to novel conditions based on minimal assumptions about the factorization of the distribution of the manipulated system.
A postulated $textitinterventional factor model$ (IFM) may not always be informative, but it conveniently abstracts away a need for explicitly modeling unmeasured confounding and feedback mechanisms.
arXiv Detail & Related papers (2023-06-06T21:44:23Z) - In Search of Insights, Not Magic Bullets: Towards Demystification of the
Model Selection Dilemma in Heterogeneous Treatment Effect Estimation [92.51773744318119]
This paper empirically investigates the strengths and weaknesses of different model selection criteria.
We highlight that there is a complex interplay between selection strategies, candidate estimators and the data used for comparing them.
arXiv Detail & Related papers (2023-02-06T16:55:37Z) - Systematic Evaluation of Predictive Fairness [60.0947291284978]
Mitigating bias in training on biased datasets is an important open problem.
We examine the performance of various debiasing methods across multiple tasks.
We find that data conditions have a strong influence on relative model performance.
arXiv Detail & Related papers (2022-10-17T05:40:13Z) - ALT-MAS: A Data-Efficient Framework for Active Testing of Machine
Learning Algorithms [58.684954492439424]
We propose a novel framework to efficiently test a machine learning model using only a small amount of labeled test data.
The idea is to estimate the metrics of interest for a model-under-test using Bayesian neural network (BNN)
arXiv Detail & Related papers (2021-04-11T12:14:04Z) - Dynamic Causal Effects Evaluation in A/B Testing with a Reinforcement
Learning Framework [68.96770035057716]
A/B testing is a business strategy to compare a new product with an old one in pharmaceutical, technological, and traditional industries.
This paper introduces a reinforcement learning framework for carrying A/B testing in online experiments.
arXiv Detail & Related papers (2020-02-05T10:25:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.