Related papers: Beyond Normality: Reliable A/B Testing with Non-Gaussian Data

Beyond Normality: Reliable A/B Testing with Non-Gaussian Data

URL: http://arxiv.org/abs/2510.23666v1
Date: Sun, 26 Oct 2025 14:44:19 GMT
Title: Beyond Normality: Reliable A/B Testing with Non-Gaussian Data
Authors: Junpeng Gong, Chunkai Wang, Hao Li, Jinyong Ma, Haoxuan Li, Xu He,
Abstract summary: We quantify how skewed, long tailed data and unequal allocation distort error rates and derive explicit formulas for the minimum sample size required for the $t$-test to remain valid.<n>We find that many online feedback metrics require hundreds of millions samples to ensure reliable A/B testing.
Score: 15.568830806973407
License: http://creativecommons.org/licenses/by/4.0/
Abstract: A/B testing has become the cornerstone of decision-making in online markets, guiding how platforms launch new features, optimize pricing strategies, and improve user experience. In practice, we typically employ the pairwise $t$-test to compare outcomes between the treatment and control groups, thereby assessing the effectiveness of a given strategy. To be trustworthy, these experiments must keep Type I error (i.e., false positive rate) under control; otherwise, we may launch harmful strategies. However, in real-world applications, we find that A/B testing often fails to deliver reliable results. When the data distribution departs from normality or when the treatment and control groups differ in sample size, the commonly used pairwise $t$-test is no longer trustworthy. In this paper, we quantify how skewed, long tailed data and unequal allocation distort error rates and derive explicit formulas for the minimum sample size required for the $t$-test to remain valid. We find that many online feedback metrics require hundreds of millions samples to ensure reliable A/B testing. Thus we introduce an Edgeworth-based correction that provides more accurate $p$-values when the available sample size is limited. Offline experiments on a leading A/B testing platform corroborate the practical value of our theoretical minimum sample size thresholds and demonstrate that the corrected method substantially improves the reliability of A/B testing in real-world conditions.

Related papers

$t$-Testing the Waters: Empirically Validating Assumptions for Reliable A/B-Testing [3.988614978933934]
A/B-tests are a cornerstone of experimental design on the web, with wide-ranging applications and use-cases.<n>We propose a practical method to test whether the $t$-test's assumptions are met, and the A/B-test is valid.<n>This provides an efficient and effective way to empirically assess whether the $t$-test's assumptions are met, and the A/B-test is valid.
arXiv Detail & Related papers (2025-02-07T09:55:24Z)
DOTA: Distributional Test-Time Adaptation of Vision-Language Models [69.41389326333771]
Vision-language foundation models can be unreliable when significant distribution gaps exist between training and test data.<n>We propose DOTA (DistributiOnal Test-time Adaptation), a simple yet effective method addressing this limitation.<n>This distribution-centric approach enables the model to continually learn and adapt to the deployment environment.
arXiv Detail & Related papers (2024-09-28T15:03:28Z)
MedBN: Robust Test-Time Adaptation against Malicious Test Samples [11.397666167665484]
Test-time adaptation (TTA) has emerged as a promising solution to address performance decay due to unforeseen distribution shifts between training and test data. Previous studies have uncovered security vulnerabilities within TTA even when a small proportion of the test batch is maliciously manipulated. We propose median batch normalization (MedBN), leveraging the robustness of the median for statistics estimation within the batch normalization layer during test-time inference.
arXiv Detail & Related papers (2024-03-28T11:33:02Z)
Uncertainty-Calibrated Test-Time Model Adaptation without Forgetting [65.21599711087538]
Test-time adaptation (TTA) seeks to tackle potential distribution shifts between training and test data by adapting a given model w.r.t. any test sample.<n>Prior methods perform backpropagation for each test sample, resulting in unbearable optimization costs to many applications.<n>We propose an Efficient Anti-Forgetting Test-Time Adaptation (EATA) method which develops an active sample selection criterion to identify reliable and non-redundant samples.
arXiv Detail & Related papers (2024-03-18T05:49:45Z)
Variance Reduction in Ratio Metrics for Efficient Online Experiments [12.036747050794135]
We apply variance reduction techniques to ratio metrics on a large-scale short-video platform: ShareChat. Our results show that we can either improve A/B-test confidence in 77% of cases, or can retain the same level of confidence with 30% fewer data points.
arXiv Detail & Related papers (2024-01-08T18:01:09Z)
Model-free Test Time Adaptation for Out-Of-Distribution Detection [62.49795078366206]
We propose a Non-Parametric Test Time textbfAdaptation framework for textbfDistribution textbfDetection (abbr) abbr utilizes online test samples for model adaptation during testing, enhancing adaptability to changing data distributions. We demonstrate the effectiveness of abbr through comprehensive experiments on multiple OOD detection benchmarks.
arXiv Detail & Related papers (2023-11-28T02:00:47Z)
Using Auxiliary Data to Boost Precision in the Analysis of A/B Tests on an Online Educational Platform: New Data and New Results [1.5293427903448025]
A/B tests allow causal effect estimation without confounding bias and exact statistical inference even in small samples. Recent methodological advances have shown that power and statistical precision can be substantially boosted by coupling design-based causal estimation to machine-learning models of rich log data from historical users who were not in the experiment. We show that the gains can be even larger for estimating subgroup effects, hold even when the remnant is unrepresentative of the A/B test sample, and extend to post-stratification population effects estimators.
arXiv Detail & Related papers (2023-06-09T21:54:36Z)
Sequential Kernelized Independence Testing [77.237958592189]
We design sequential kernelized independence tests inspired by kernelized dependence measures.<n>We demonstrate the power of our approaches on both simulated and real data.
arXiv Detail & Related papers (2022-12-14T18:08:42Z)
Learn what you can't learn: Regularized Ensembles for Transductive Out-of-distribution Detection [76.39067237772286]
We show that current out-of-distribution (OOD) detection algorithms for neural networks produce unsatisfactory results in a variety of OOD detection scenarios. This paper studies how such "hard" OOD scenarios can benefit from adjusting the detection method after observing a batch of the test data. We propose a novel method that uses an artificial labeling scheme for the test data and regularization to obtain ensembles of models that produce contradictory predictions only on the OOD samples in a test batch.
arXiv Detail & Related papers (2020-12-10T16:55:13Z)
Noisy Adaptive Group Testing using Bayesian Sequential Experimental Design [63.48989885374238]
When the infection prevalence of a disease is low, Dorfman showed 80 years ago that testing groups of people can prove more efficient than testing people individually. Our goal in this paper is to propose new group testing algorithms that can operate in a noisy setting.
arXiv Detail & Related papers (2020-04-26T23:41:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.