With Little Power Comes Great Responsibility
- URL: http://arxiv.org/abs/2010.06595v1
- Date: Tue, 13 Oct 2020 18:00:02 GMT
- Title: With Little Power Comes Great Responsibility
- Authors: Dallas Card and Peter Henderson and Urvashi Khandelwal and Robin Jia
and Kyle Mahowald and Dan Jurafsky
- Abstract summary: Underpowered experiments make it more difficult to discern the difference between statistical noise and meaningful model improvements.
Small test sets mean that most attempted comparisons to state of the art models will not be adequately powered.
For machine translation, we find that typical test sets of 2000 sentences have approximately 75% power to detect differences of 1 BLEU point.
- Score: 54.96675741328462
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite its importance to experimental design, statistical power (the
probability that, given a real effect, an experiment will reject the null
hypothesis) has largely been ignored by the NLP community. Underpowered
experiments make it more difficult to discern the difference between
statistical noise and meaningful model improvements, and increase the chances
of exaggerated findings. By meta-analyzing a set of existing NLP papers and
datasets, we characterize typical power for a variety of settings and conclude
that underpowered experiments are common in the NLP literature. In particular,
for several tasks in the popular GLUE benchmark, small test sets mean that most
attempted comparisons to state of the art models will not be adequately
powered. Similarly, based on reasonable assumptions, we find that the most
typical experimental design for human rating studies will be underpowered to
detect small model differences, of the sort that are frequently studied. For
machine translation, we find that typical test sets of 2000 sentences have
approximately 75% power to detect differences of 1 BLEU point. To improve the
situation going forward, we give an overview of best practices for power
analysis in NLP and release a series of notebooks to assist with future power
analyses.
Related papers
- Strength of statistical evidence for genuine tripartite nonlocality [0.0]
Recent advancements in network nonlocality have led to the concept of local operations and shared randomness-based genuine multipartite nonlocality (LOSR-GMNL)
This paper focuses on a tripartite scenario where the goal is to exhibit correlations impossible in a network where each two-party subset shares bipartite resources and every party has access to unlimited shared randomness.
arXiv Detail & Related papers (2024-07-28T21:12:52Z) - Provably Neural Active Learning Succeeds via Prioritizing Perplexing Samples [53.95282502030541]
Neural Network-based active learning (NAL) is a cost-effective data selection technique that utilizes neural networks to select and train on a small subset of samples.
We try to move one step forward by offering a unified explanation for the success of both query criteria-based NAL from a feature learning view.
arXiv Detail & Related papers (2024-06-06T10:38:01Z) - Using Auxiliary Data to Boost Precision in the Analysis of A/B Tests on
an Online Educational Platform: New Data and New Results [1.5293427903448025]
A/B tests allow causal effect estimation without confounding bias and exact statistical inference even in small samples.
Recent methodological advances have shown that power and statistical precision can be substantially boosted by coupling design-based causal estimation to machine-learning models of rich log data from historical users who were not in the experiment.
We show that the gains can be even larger for estimating subgroup effects, hold even when the remnant is unrepresentative of the A/B test sample, and extend to post-stratification population effects estimators.
arXiv Detail & Related papers (2023-06-09T21:54:36Z) - Preserving Knowledge Invariance: Rethinking Robustness Evaluation of
Open Information Extraction [50.62245481416744]
We present the first benchmark that simulates the evaluation of open information extraction models in the real world.
We design and annotate a large-scale testbed in which each example is a knowledge-invariant clique.
By further elaborating the robustness metric, a model is judged to be robust if its performance is consistently accurate on the overall cliques.
arXiv Detail & Related papers (2023-05-23T12:05:09Z) - An Empirical Study on the Language Modal in Visual Question Answering [31.692905677913068]
Generalization beyond in-domain experience to out-of-distribution data is of paramount significance in the AI domain.
This paper attempts to provide new insights into the influence of language modality on VQA performance.
arXiv Detail & Related papers (2023-05-17T11:56:40Z) - Empirical Design in Reinforcement Learning [23.873958977534993]
It is now common to benchmark agents with millions of parameters against dozens of tasks, each using the equivalent of 30 days of experience.
The scale of these experiments often conflict with the need for proper statistical evidence, especially when comparing algorithms.
This manuscript represents both a call to action, and a comprehensive resource for how to do good experiments in reinforcement learning.
arXiv Detail & Related papers (2023-04-03T19:32:24Z) - Square One Bias in NLP: Towards a Multi-Dimensional Exploration of the
Research Manifold [88.83876819883653]
We show through a manual classification of recent NLP research papers that this is indeed the case.
We observe that NLP research often goes beyond the square one setup, focusing not only on accuracy, but also on fairness or interpretability, but typically only along a single dimension.
arXiv Detail & Related papers (2022-06-20T13:04:23Z) - Expected Validation Performance and Estimation of a Random Variable's
Maximum [48.83713377993604]
We analyze three statistical estimators for expected validation performance.
We find the unbiased estimator has the highest variance, and the estimator with the smallest variance has the largest bias.
We find that the two biased estimators lead to the fewest incorrect conclusions.
arXiv Detail & Related papers (2021-10-01T18:48:47Z) - Challenges in Statistical Analysis of Data Collected by a Bandit
Algorithm: An Empirical Exploration in Applications to Adaptively Randomized
Experiments [11.464963616709671]
Multi-armed bandit algorithms have been argued for decades as useful for adaptively randomized experiments.
We applied the bandit algorithm Thompson Sampling (TS) to run adaptive experiments in three university classes.
We show that collecting data with TS can as much as double the False Positive Rate (FPR) and the False Negative Rate (FNR)
arXiv Detail & Related papers (2021-03-22T22:05:18Z) - Predicting Performance for Natural Language Processing Tasks [128.34208911925424]
We build regression models to predict the evaluation score of an NLP experiment given the experimental settings as input.
Experimenting on 9 different NLP tasks, we find that our predictors can produce meaningful predictions over unseen languages and different modeling architectures.
arXiv Detail & Related papers (2020-05-02T16:02:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.