Strategic A/B testing via Maximum Probability-driven Two-armed Bandit
- URL: http://arxiv.org/abs/2506.22536v1
- Date: Fri, 27 Jun 2025 17:15:57 GMT
- Title: Strategic A/B testing via Maximum Probability-driven Two-armed Bandit
- Authors: Yu Zhang, Shanshan Zhao, Bokui Wan, Jinjuan Wang, Xiaodong Yan,
- Abstract summary: This work proposes a maximum probability-driven two-armed bandit (TAB) process by weighting the mean volatility statistic.<n>The implementation of permutation methods further enhances the robustness and efficacy.<n>The experimental results indicate a significant improvement in the A/B testing, highlighting the potential to reduce experimental costs while maintaining high statistical power.
- Score: 8.336506371247559
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Detecting a minor average treatment effect is a major challenge in large-scale applications, where even minimal improvements can have a significant economic impact. Traditional methods, reliant on normal distribution-based or expanded statistics, often fail to identify such minor effects because of their inability to handle small discrepancies with sufficient sensitivity. This work leverages a counterfactual outcome framework and proposes a maximum probability-driven two-armed bandit (TAB) process by weighting the mean volatility statistic, which controls Type I error. The implementation of permutation methods further enhances the robustness and efficacy. The established strategic central limit theorem (SCLT) demonstrates that our approach yields a more concentrated distribution under the null hypothesis and a less concentrated one under the alternative hypothesis, greatly improving statistical power. The experimental results indicate a significant improvement in the A/B testing, highlighting the potential to reduce experimental costs while maintaining high statistical power.
Related papers
- Efficient and Scalable Estimation of Distributional Treatment Effects with Multi-Task Neural Networks [5.895315872876525]
We propose a novel multi-task neural network approach for estimating distributional treatment effects in randomized experiments.<n>We apply our method to both simulated and real-world datasets, including a randomized field experiment aimed at reducing water consumption in the US and a large-scale A/B test from a leading streaming platform in Japan.
arXiv Detail & Related papers (2025-07-10T13:16:33Z) - Practical Improvements of A/B Testing with Off-Policy Estimation [51.25970890274447]
We introduce a family of unbiased off-policy estimators that achieves lower variance than the standard approach.<n>Our theoretical analysis and experimental results validate the effectiveness and practicality of the proposed method.
arXiv Detail & Related papers (2025-06-12T13:11:01Z) - A Generative Framework for Causal Estimation via Importance-Weighted Diffusion Distillation [55.53426007439564]
Estimating individualized treatment effects from observational data is a central challenge in causal inference.<n>In inverse probability weighting (IPW) is a well-established solution to this problem, but its integration into modern deep learning frameworks remains limited.<n>We propose Importance-Weighted Diffusion Distillation (IWDD), a novel generative framework that combines the pretraining of diffusion models with importance-weighted score distillation.
arXiv Detail & Related papers (2025-05-16T17:00:52Z) - Strength of statistical evidence for genuine tripartite nonlocality [0.0]
Recent advancements in network nonlocality have led to the concept of local operations and shared randomness-based genuine multipartite nonlocality (LOSR-GMNL)
This paper focuses on a tripartite scenario where the goal is to exhibit correlations impossible in a network where each two-party subset shares bipartite resources and every party has access to unlimited shared randomness.
arXiv Detail & Related papers (2024-07-28T21:12:52Z) - STATE: A Robust ATE Estimator of Heavy-Tailed Metrics for Variance Reduction in Online Controlled Experiments [22.32661807469984]
We develop a novel framework that integrates the Student's t-distribution with machine learning tools to fit heavy-tailed metrics.
By adopting a variational EM method to optimize the loglikehood function, we can infer a robust solution that greatly eliminates the negative impact of outliers.
Both simulations on synthetic data and long-term empirical results on Meituan experiment platform demonstrate the effectiveness of our method.
arXiv Detail & Related papers (2024-07-23T09:35:59Z) - Max-Rank: Efficient Multiple Testing for Conformal Prediction [43.56898111853698]
Multiple hypothesis testing (MHT) frequently arises in scientific inquiries, and concurrent testing of multiple hypotheses inflates the risk of Type-I errors or false positives.<n>This paper addresses MHT in the context of conformal prediction, a flexible framework for predictive uncertainty quantification.<n>We introduce $textttmax-rank$, a novel correction that exploits dependencies whilst efficiently controlling the family-wise error rate.
arXiv Detail & Related papers (2023-11-17T22:44:22Z) - Distribution-Free Statistical Dispersion Control for Societal
Applications [16.43522470711466]
Explicit finite-sample statistical guarantees on model performance are an important ingredient in responsible machine learning.
Previous work has focused mainly on bounding either the expected loss of a predictor or the probability that an individual prediction will incur a loss value in a specified range.
We propose a simple yet flexible framework that allows us to handle a much richer class of statistical functionals beyond previous work.
arXiv Detail & Related papers (2023-09-25T00:31:55Z) - B-Learner: Quasi-Oracle Bounds on Heterogeneous Causal Effects Under
Hidden Confounding [51.74479522965712]
We propose a meta-learner called the B-Learner, which can efficiently learn sharp bounds on the CATE function under limits on hidden confounding.
We prove its estimates are valid, sharp, efficient, and have a quasi-oracle property with respect to the constituent estimators under more general conditions than existing methods.
arXiv Detail & Related papers (2023-04-20T18:07:19Z) - Scale-Equivalent Distillation for Semi-Supervised Object Detection [57.59525453301374]
Recent Semi-Supervised Object Detection (SS-OD) methods are mainly based on self-training, generating hard pseudo-labels by a teacher model on unlabeled data as supervisory signals.
We analyze the challenges these methods meet with the empirical experiment results.
We introduce a novel approach, Scale-Equivalent Distillation (SED), which is a simple yet effective end-to-end knowledge distillation framework robust to large object size variance and class imbalance.
arXiv Detail & Related papers (2022-03-23T07:33:37Z) - Differential privacy and robust statistics in high dimensions [49.50869296871643]
High-dimensional Propose-Test-Release (HPTR) builds upon three crucial components: the exponential mechanism, robust statistics, and the Propose-Test-Release mechanism.
We show that HPTR nearly achieves the optimal sample complexity under several scenarios studied in the literature.
arXiv Detail & Related papers (2021-11-12T06:36:40Z) - Efficient Inference Without Trading-off Regret in Bandits: An Allocation
Probability Test for Thompson Sampling [1.6114012813668934]
Using bandit algorithms to conduct adaptive randomised experiments can minimise regret, but it poses major challenges for statistical inference.
Recent attempts to address these challenges typically impose restrictions on the exploitative nature of the bandit$-$trading off regret$-$and require large sample sizes to ensure guarantees.
We introduce a novel hypothesis test, uniquely based on the allocation probabilities of the bandit algorithm, and without constraining its exploitative nature or requiring a minimum experimental size.
We demonstrate the regret and inferential advantages of our approach, particularly in small samples, in both extensive simulations and in a real-world experiment on mental health aspects
arXiv Detail & Related papers (2021-10-30T01:47:14Z) - With Little Power Comes Great Responsibility [54.96675741328462]
Underpowered experiments make it more difficult to discern the difference between statistical noise and meaningful model improvements.
Small test sets mean that most attempted comparisons to state of the art models will not be adequately powered.
For machine translation, we find that typical test sets of 2000 sentences have approximately 75% power to detect differences of 1 BLEU point.
arXiv Detail & Related papers (2020-10-13T18:00:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.