Assessing effect sizes, variability, and power in the on-line study of language production
- URL: http://arxiv.org/abs/2403.15459v1
- Date: Tue, 19 Mar 2024 11:49:03 GMT
- Title: Assessing effect sizes, variability, and power in the on-line study of language production
- Authors: Bürki Audrey, Vasishth Shravan,
- Abstract summary: We compare response time data obtained in the same word production experiment conducted in the lab and on-line.
We determine whether the two settings differ in effect sizes, in the consistency of responses over the course of the experiment.
We assess the impact of these differences on the power of the design in a series of simulations.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: With the pandemic, many experimental psychologists and linguists have started to collect data over the internet (hereafter on-line data). The feasibility of such experiments and the sample sizes required to achieve sufficient statistical power in future experiments have to be assessed. This in turn requires information on effect sizes and variability. In a series of analyses, we compare response time data obtained in the same word production experiment conducted in the lab and on-line. These analyses allow us to determine whether the two settings differ in effect sizes, in the consistency of responses over the course of the experiment, in the variability of average response times across participants, in the magnitude of effect sizes across participants, or in the amount of unexplained variability. We assess the impact of these differences on the power of the design in a series of simulations. Our findings temper the enthusiasm raised by previous studies and suggest that on-line production studies might be feasible but at a non-negligible cost. The sample sizes required to achieve sufficient power in on-line language production studies come with a non-negligible increase in the amount of manual labour.
Related papers
- Contexts Matter: An Empirical Study on Contextual Influence in Fairness Testing for Deep Learning Systems [3.077531983369872]
We aim to understand how varying contexts affect fairness testing outcomes.
Our results show that different context types and settings generally lead to a significant impact on the testing.
arXiv Detail & Related papers (2024-08-12T12:36:06Z) - Relevant information in TDD experiment reporting [40.670930098576775]
This article aims to identify the response variable operationalization components in TDD experiments that study external quality.
The test suites, intervention types and measurers have an influence on the measurements and results of the systematic mapping study (SMS)
The results of our SMS confirm that TDD experiments do not usually report either the test suites, the test case generation method, or the details of how external quality was measured.
arXiv Detail & Related papers (2024-06-10T15:57:56Z) - Identification of Single-Treatment Effects in Factorial Experiments [0.0]
I show that when multiple interventions are randomized in experiments, the effect any single intervention would have outside the experimental setting is not identified absent heroic assumptions.
observational studies and factorial experiments provide information about potential-outcome distributions with zero and multiple interventions.
I show that researchers who rely on this type of design have to justify either linearity of functional forms or specify with Directed Acyclic Graphs how variables are related in the real world.
arXiv Detail & Related papers (2024-05-16T04:01:53Z) - The Mirrored Influence Hypothesis: Efficient Data Influence Estimation by Harnessing Forward Passes [30.30769701138665]
We introduce and explore the Mirrored Influence Hypothesis, highlighting a reciprocal nature of influence between training and test data.
Specifically, it suggests that evaluating the influence of training data on test predictions can be reformulated as an equivalent, yet inverse problem.
We introduce a new method for estimating the influence of training data, which requires calculating gradients for specific test samples, paired with a forward pass for each training point.
arXiv Detail & Related papers (2024-02-14T03:43:05Z) - Sensitivity, Performance, Robustness: Deconstructing the Effect of
Sociodemographic Prompting [64.80538055623842]
sociodemographic prompting is a technique that steers the output of prompt-based models towards answers that humans with specific sociodemographic profiles would give.
We show that sociodemographic information affects model predictions and can be beneficial for improving zero-shot learning in subjective NLP tasks.
arXiv Detail & Related papers (2023-09-13T15:42:06Z) - Fair Effect Attribution in Parallel Online Experiments [57.13281584606437]
A/B tests serve the purpose of reliably identifying the effect of changes introduced in online services.
It is common for online platforms to run a large number of simultaneous experiments by splitting incoming user traffic randomly.
Despite a perfect randomization between different groups, simultaneous experiments can interact with each other and create a negative impact on average population outcomes.
arXiv Detail & Related papers (2022-10-15T17:15:51Z) - Task Formulation Matters When Learning Continually: A Case Study in
Visual Question Answering [58.82325933356066]
Continual learning aims to train a model incrementally on a sequence of tasks without forgetting previous knowledge.
We present a detailed study of how different settings affect performance for Visual Question Answering.
arXiv Detail & Related papers (2022-09-30T19:12:58Z) - Do Deep Neural Networks Always Perform Better When Eating More Data? [82.6459747000664]
We design experiments from Identically Independent Distribution(IID) and Out of Distribution(OOD)
Under IID condition, the amount of information determines the effectivity of each sample, the contribution of samples and difference between classes determine the amount of class information.
Under OOD condition, the cross-domain degree of samples determine the contributions, and the bias-fitting caused by irrelevant elements is a significant factor of cross-domain.
arXiv Detail & Related papers (2022-05-30T15:40:33Z) - On the Importance of Data Size in Probing Fine-tuned Models [18.69409646532038]
We show that the extent of encoded linguistic knowledge depends on the number of fine-tuning samples.
We show through a set of experiments that fine-tuning data size affects the recoverability of the changes made to the model's linguistic knowledge.
arXiv Detail & Related papers (2022-03-17T21:45:17Z) - Towards Understanding Sample Variance in Visually Grounded Language
Generation: Evaluations and Observations [67.4375210552593]
We design experiments to understand an important but often ignored problem in visually grounded language generation.
Given that humans have different utilities and visual attention, how will the sample variance in multi-reference datasets affect the models' performance?
We show that it is of paramount importance to report variance in experiments; that human-generated references could vary drastically in different datasets/tasks, revealing the nature of each task.
arXiv Detail & Related papers (2020-10-07T20:45:14Z) - Fine-Tuning Pretrained Language Models: Weight Initializations, Data
Orders, and Early Stopping [62.78338049381917]
Fine-tuning pretrained contextual word embedding models to supervised downstream tasks has become commonplace in natural language processing.
We experiment with four datasets from the GLUE benchmark, fine-tuning BERT hundreds of times on each while varying only the random seeds.
We find substantial performance increases compared to previously reported results, and we quantify how the performance of the best-found model varies as a function of the number of fine-tuning trials.
arXiv Detail & Related papers (2020-02-15T02:40:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.