Challenges in Statistical Analysis of Data Collected by a Bandit
Algorithm: An Empirical Exploration in Applications to Adaptively Randomized
Experiments
- URL: http://arxiv.org/abs/2103.12198v1
- Date: Mon, 22 Mar 2021 22:05:18 GMT
- Title: Challenges in Statistical Analysis of Data Collected by a Bandit
Algorithm: An Empirical Exploration in Applications to Adaptively Randomized
Experiments
- Authors: Joseph Jay Williams, Jacob Nogas, Nina Deliu, Hammad Shaikh, Sofia
Villar, Audrey Durand, Anna Rafferty
- Abstract summary: Multi-armed bandit algorithms have been argued for decades as useful for adaptively randomized experiments.
We applied the bandit algorithm Thompson Sampling (TS) to run adaptive experiments in three university classes.
We show that collecting data with TS can as much as double the False Positive Rate (FPR) and the False Negative Rate (FNR)
- Score: 11.464963616709671
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Multi-armed bandit algorithms have been argued for decades as useful for
adaptively randomized experiments. In such experiments, an algorithm varies
which arms (e.g. alternative interventions to help students learn) are assigned
to participants, with the goal of assigning higher-reward arms to as many
participants as possible. We applied the bandit algorithm Thompson Sampling
(TS) to run adaptive experiments in three university classes. Instructors saw
great value in trying to rapidly use data to give their students in the
experiments better arms (e.g. better explanations of a concept). Our
deployment, however, illustrated a major barrier for scientists and
practitioners to use such adaptive experiments: a lack of quantifiable insight
into how much statistical analysis of specific real-world experiments is
impacted (Pallmann et al, 2018; FDA, 2019), compared to traditional uniform
random assignment. We therefore use our case study of the ubiquitous two-arm
binary reward setting to empirically investigate the impact of using Thompson
Sampling instead of uniform random assignment. In this setting, using common
statistical hypothesis tests, we show that collecting data with TS can as much
as double the False Positive Rate (FPR; incorrectly reporting differences when
none exist) and the False Negative Rate (FNR; failing to report differences
when they exist)...
Related papers
- Optimal Multi-Distribution Learning [88.3008613028333]
Multi-distribution learning seeks to learn a shared model that minimizes the worst-case risk across $k$ distinct data distributions.
We propose a novel algorithm that yields an varepsilon-optimal randomized hypothesis with a sample complexity on the order of (d+k)/varepsilon2.
arXiv Detail & Related papers (2023-12-08T16:06:29Z) - Empirical Design in Reinforcement Learning [23.873958977534993]
It is now common to benchmark agents with millions of parameters against dozens of tasks, each using the equivalent of 30 days of experience.
The scale of these experiments often conflict with the need for proper statistical evidence, especially when comparing algorithms.
This manuscript represents both a call to action, and a comprehensive resource for how to do good experiments in reinforcement learning.
arXiv Detail & Related papers (2023-04-03T19:32:24Z) - Assign Experiment Variants at Scale in Online Controlled Experiments [1.9205538784019935]
Online controlled experiments (A/B tests) have become the gold standard for learning the impact of new product features in technology companies.
Technology companies run A/B tests at scale -- hundreds if not thousands of A/B tests concurrently, each with millions of users.
We present a novel assignment algorithm and statistical tests to validate the randomized assignments.
arXiv Detail & Related papers (2022-12-17T00:45:12Z) - Using Adaptive Experiments to Rapidly Help Students [5.446351709118483]
We evaluate the effect of homework email reminders in students by conducting an adaptive experiment using the Thompson Sampling algorithm.
We raise a range of open questions about the conditions under which adaptive randomized experiments may be more or less useful.
arXiv Detail & Related papers (2022-08-10T00:43:05Z) - Increasing Students' Engagement to Reminder Emails Through Multi-Armed
Bandits [60.4933541247257]
This paper shows a real-world adaptive experiment on how students engage with instructors' weekly email reminders to build their time management habits.
Using Multi-Armed Bandits (MAB) algorithms in adaptive experiments can increase students' chances of obtaining better outcomes.
We highlight problems with these adaptive algorithms - such as possible exploitation of an arm when there is no significant difference.
arXiv Detail & Related papers (2022-08-10T00:30:52Z) - Algorithms for Adaptive Experiments that Trade-off Statistical Analysis
with Reward: Combining Uniform Random Assignment and Reward Maximization [50.725191156128645]
Multi-armed bandit algorithms like Thompson Sampling can be used to conduct adaptive experiments.
We present simulations for 2-arm experiments that explore two algorithms that combine the benefits of uniform randomization for statistical analysis.
arXiv Detail & Related papers (2021-12-15T22:11:58Z) - With Little Power Comes Great Responsibility [54.96675741328462]
Underpowered experiments make it more difficult to discern the difference between statistical noise and meaningful model improvements.
Small test sets mean that most attempted comparisons to state of the art models will not be adequately powered.
For machine translation, we find that typical test sets of 2000 sentences have approximately 75% power to detect differences of 1 BLEU point.
arXiv Detail & Related papers (2020-10-13T18:00:02Z) - Tracking disease outbreaks from sparse data with Bayesian inference [55.82986443159948]
The COVID-19 pandemic provides new motivation for estimating the empirical rate of transmission during an outbreak.
Standard methods struggle to accommodate the partial observability and sparse data common at finer scales.
We propose a Bayesian framework which accommodates partial observability in a principled manner.
arXiv Detail & Related papers (2020-09-12T20:37:33Z) - Two-Sample Testing on Ranked Preference Data and the Role of Modeling
Assumptions [57.77347280992548]
In this paper, we design two-sample tests for pairwise comparison data and ranking data.
Our test requires essentially no assumptions on the distributions.
By applying our two-sample test on real-world pairwise comparison data, we conclude that ratings and rankings provided by people are indeed distributed differently.
arXiv Detail & Related papers (2020-06-21T20:51:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.