Related papers: Challenges in Statistical Analysis of Data Collected by a Bandit Algorithm: An Empirical Exploration in Applications to Adaptively Randomized Experiments

Challenges in Statistical Analysis of Data Collected by a Bandit Algorithm: An Empirical Exploration in Applications to Adaptively Randomized Experiments

URL: http://arxiv.org/abs/2103.12198v1
Date: Mon, 22 Mar 2021 22:05:18 GMT
Title: Challenges in Statistical Analysis of Data Collected by a Bandit Algorithm: An Empirical Exploration in Applications to Adaptively Randomized Experiments
Authors: Joseph Jay Williams, Jacob Nogas, Nina Deliu, Hammad Shaikh, Sofia Villar, Audrey Durand, Anna Rafferty
Abstract summary: Multi-armed bandit algorithms have been argued for decades as useful for adaptively randomized experiments. We applied the bandit algorithm Thompson Sampling (TS) to run adaptive experiments in three university classes. We show that collecting data with TS can as much as double the False Positive Rate (FPR) and the False Negative Rate (FNR)
Score: 11.464963616709671
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Multi-armed bandit algorithms have been argued for decades as useful for adaptively randomized experiments. In such experiments, an algorithm varies which arms (e.g. alternative interventions to help students learn) are assigned to participants, with the goal of assigning higher-reward arms to as many participants as possible. We applied the bandit algorithm Thompson Sampling (TS) to run adaptive experiments in three university classes. Instructors saw great value in trying to rapidly use data to give their students in the experiments better arms (e.g. better explanations of a concept). Our deployment, however, illustrated a major barrier for scientists and practitioners to use such adaptive experiments: a lack of quantifiable insight into how much statistical analysis of specific real-world experiments is impacted (Pallmann et al, 2018; FDA, 2019), compared to traditional uniform random assignment. We therefore use our case study of the ubiquitous two-arm binary reward setting to empirically investigate the impact of using Thompson Sampling instead of uniform random assignment. In this setting, using common statistical hypothesis tests, we show that collecting data with TS can as much as double the False Positive Rate (FPR; incorrectly reporting differences when none exist) and the False Negative Rate (FNR; failing to report differences when they exist)...

Related papers

Causal Lifting of Neural Representations: Zero-Shot Generalization for Causal Inferences [56.23412698865433]
We focus on causal inferences on a target experiment with unlabeled factual outcomes, retrieved by a predictive model fine-tuned on a labeled similar experiment. First, we show that factual outcome estimation via Empirical Risk Minimization (ERM) may fail to yield valid causal inferences on the target population. We propose Deconfounded Empirical Risk Minimization (DERM), a new simple learning procedure minimizing the risk over a fictitious target population.
arXiv Detail & Related papers (2025-02-10T10:52:17Z)
Optimal Multi-Distribution Learning [88.3008613028333]
Multi-distribution learning seeks to learn a shared model that minimizes the worst-case risk across $k$ distinct data distributions. We propose a novel algorithm that yields an varepsilon-optimal randomized hypothesis with a sample complexity on the order of (d+k)/varepsilon2.
arXiv Detail & Related papers (2023-12-08T16:06:29Z)
Empirical Design in Reinforcement Learning [23.873958977534993]
It is now common to benchmark agents with millions of parameters against dozens of tasks, each using the equivalent of 30 days of experience. The scale of these experiments often conflict with the need for proper statistical evidence, especially when comparing algorithms. This manuscript represents both a call to action, and a comprehensive resource for how to do good experiments in reinforcement learning.
arXiv Detail & Related papers (2023-04-03T19:32:24Z)
Assign Experiment Variants at Scale in Online Controlled Experiments [1.9205538784019935]
Online controlled experiments (A/B tests) have become the gold standard for learning the impact of new product features in technology companies. Technology companies run A/B tests at scale -- hundreds if not thousands of A/B tests concurrently, each with millions of users. We present a novel assignment algorithm and statistical tests to validate the randomized assignments.
arXiv Detail & Related papers (2022-12-17T00:45:12Z)
Using Adaptive Experiments to Rapidly Help Students [5.446351709118483]
We evaluate the effect of homework email reminders in students by conducting an adaptive experiment using the Thompson Sampling algorithm. We raise a range of open questions about the conditions under which adaptive randomized experiments may be more or less useful.
arXiv Detail & Related papers (2022-08-10T00:43:05Z)
Increasing Students' Engagement to Reminder Emails Through Multi-Armed Bandits [60.4933541247257]
This paper shows a real-world adaptive experiment on how students engage with instructors' weekly email reminders to build their time management habits. Using Multi-Armed Bandits (MAB) algorithms in adaptive experiments can increase students' chances of obtaining better outcomes. We highlight problems with these adaptive algorithms - such as possible exploitation of an arm when there is no significant difference.
arXiv Detail & Related papers (2022-08-10T00:30:52Z)
Algorithms for Adaptive Experiments that Trade-off Statistical Analysis with Reward: Combining Uniform Random Assignment and Reward Maximization [50.725191156128645]
Multi-armed bandit algorithms like Thompson Sampling can be used to conduct adaptive experiments. We present simulations for 2-arm experiments that explore two algorithms that combine the benefits of uniform randomization for statistical analysis.
arXiv Detail & Related papers (2021-12-15T22:11:58Z)
With Little Power Comes Great Responsibility [54.96675741328462]
Underpowered experiments make it more difficult to discern the difference between statistical noise and meaningful model improvements. Small test sets mean that most attempted comparisons to state of the art models will not be adequately powered. For machine translation, we find that typical test sets of 2000 sentences have approximately 75% power to detect differences of 1 BLEU point.
arXiv Detail & Related papers (2020-10-13T18:00:02Z)
Tracking disease outbreaks from sparse data with Bayesian inference [55.82986443159948]
The COVID-19 pandemic provides new motivation for estimating the empirical rate of transmission during an outbreak. Standard methods struggle to accommodate the partial observability and sparse data common at finer scales. We propose a Bayesian framework which accommodates partial observability in a principled manner.
arXiv Detail & Related papers (2020-09-12T20:37:33Z)
Two-Sample Testing on Ranked Preference Data and the Role of Modeling Assumptions [57.77347280992548]
In this paper, we design two-sample tests for pairwise comparison data and ranking data. Our test requires essentially no assumptions on the distributions. By applying our two-sample test on real-world pairwise comparison data, we conclude that ratings and rankings provided by people are indeed distributed differently.
arXiv Detail & Related papers (2020-06-21T20:51:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.