$t$-Testing the Waters: Empirically Validating Assumptions for Reliable A/B-Testing
- URL: http://arxiv.org/abs/2502.04793v1
- Date: Fri, 07 Feb 2025 09:55:24 GMT
- Title: $t$-Testing the Waters: Empirically Validating Assumptions for Reliable A/B-Testing
- Authors: Olivier Jeunen,
- Abstract summary: A/B-tests are a cornerstone of experimental design on the web, with wide-ranging applications and use-cases.
We propose a practical method to test whether the $t$-test's assumptions are met, and the A/B-test is valid.
This provides an efficient and effective way to empirically assess whether the $t$-test's assumptions are met, and the A/B-test is valid.
- Score: 3.988614978933934
- License:
- Abstract: A/B-tests are a cornerstone of experimental design on the web, with wide-ranging applications and use-cases. The statistical $t$-test comparing differences in means is the most commonly used method for assessing treatment effects, often justified through the Central Limit Theorem (CLT). The CLT ascertains that, as the sample size grows, the sampling distribution of the Average Treatment Effect converges to normality, making the $t$-test valid for sufficiently large sample sizes. When outcome measures are skewed or non-normal, quantifying what "sufficiently large" entails is not straightforward. To ensure that confidence intervals maintain proper coverage and that $p$-values accurately reflect the false positive rate, it is critical to validate this normality assumption. We propose a practical method to test this, by analysing repeatedly resampled A/A-tests. When the normality assumption holds, the resulting $p$-value distribution should be uniform, and this property can be tested using the Kolmogorov-Smirnov test. This provides an efficient and effective way to empirically assess whether the $t$-test's assumptions are met, and the A/B-test is valid. We demonstrate our methodology and highlight how it helps to identify scenarios prone to inflated Type-I errors. Our approach provides a practical framework to ensure and improve the reliability and robustness of A/B-testing practices.
Related papers
- An Upper Confidence Bound Approach to Estimating the Maximum Mean [0.0]
We study estimation of the maximum mean using an upper confidence bound (UCB) approach.
We establish statistical guarantees, including strong consistency, mean squared errors, and central limit theorems (CLTs) for both estimators.
arXiv Detail & Related papers (2024-08-08T02:53:09Z) - Ranking by Lifts: A Cost-Benefit Approach to Large-Scale A/B Tests [0.0]
This work develops a decision-theoretic framework for maximizing profits subject to false discovery rate (FDR) control.
We build an empirical Bayes solution for the problem via a greedy knapsack approach.
Our oracle decision rule is valid and optimal for large-scale tests.
arXiv Detail & Related papers (2024-07-01T07:40:08Z) - Model-free Test Time Adaptation for Out-Of-Distribution Detection [62.49795078366206]
We propose a Non-Parametric Test Time textbfAdaptation framework for textbfDistribution textbfDetection (abbr)
abbr utilizes online test samples for model adaptation during testing, enhancing adaptability to changing data distributions.
We demonstrate the effectiveness of abbr through comprehensive experiments on multiple OOD detection benchmarks.
arXiv Detail & Related papers (2023-11-28T02:00:47Z) - Precise Error Rates for Computationally Efficient Testing [75.63895690909241]
We revisit the question of simple-versus-simple hypothesis testing with an eye towards computational complexity.
An existing test based on linear spectral statistics achieves the best possible tradeoff curve between type I and type II error rates.
arXiv Detail & Related papers (2023-11-01T04:41:16Z) - A Semi-Bayesian Nonparametric Estimator of the Maximum Mean Discrepancy
Measure: Applications in Goodness-of-Fit Testing and Generative Adversarial
Networks [3.623570119514559]
We propose a semi-Bayesian nonparametric (semi-BNP) procedure for the goodness-of-fit (GOF) test.
Our method introduces a novel Bayesian estimator for the maximum mean discrepancy (MMD) measure.
We demonstrate that our proposed test outperforms frequentist MMD-based methods by achieving a lower false rejection and acceptance rate of the null hypothesis.
arXiv Detail & Related papers (2023-03-05T10:36:21Z) - Sequential Kernelized Independence Testing [101.22966794822084]
We design sequential kernelized independence tests inspired by kernelized dependence measures.
We demonstrate the power of our approaches on both simulated and real data.
arXiv Detail & Related papers (2022-12-14T18:08:42Z) - Cost-aware Generalized $\alpha$-investing for Multiple Hypothesis
Testing [5.521213530218833]
We consider the problem of sequential multiple hypothesis testing with nontrivial data collection costs.
This problem appears when conducting biological experiments to identify differentially expressed genes of a disease process.
We make a theoretical analysis of the long term behavior of $alpha$-wealth which motivates a consideration of sample size in the $alpha$-investing decision rule.
arXiv Detail & Related papers (2022-10-31T17:39:32Z) - Sequential Permutation Testing of Random Forest Variable Importance
Measures [68.8204255655161]
It is proposed here to use sequential permutation tests and sequential p-value estimation to reduce the high computational costs associated with conventional permutation tests.
The results of simulation studies confirm that the theoretical properties of the sequential tests apply.
The numerical stability of the methods is investigated in two additional application studies.
arXiv Detail & Related papers (2022-06-02T20:16:50Z) - Cross-validation Confidence Intervals for Test Error [83.67415139421448]
This work develops central limit theorems for crossvalidation and consistent estimators of its variance under weak stability conditions on the learning algorithm.
Results are the first of their kind for the popular choice of leave-one-out cross-validation.
arXiv Detail & Related papers (2020-07-24T17:40:06Z) - Noisy Adaptive Group Testing using Bayesian Sequential Experimental
Design [63.48989885374238]
When the infection prevalence of a disease is low, Dorfman showed 80 years ago that testing groups of people can prove more efficient than testing people individually.
Our goal in this paper is to propose new group testing algorithms that can operate in a noisy setting.
arXiv Detail & Related papers (2020-04-26T23:41:33Z) - Nonparametric Inference under B-bits Quantization [5.958064620718292]
We propose a nonparametric testing procedure based on samples quantized to $B$ bits.
In particular, we show that if $B$ exceeds a certain threshold, the proposed nonparametric testing procedure achieves the classical minimax rate of testing.
arXiv Detail & Related papers (2019-01-24T18:43:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.