"Medium-n studies" in computing education conferences
- URL: http://arxiv.org/abs/2311.14679v2
- Date: Tue, 28 Nov 2023 14:32:13 GMT
- Title: "Medium-n studies" in computing education conferences
- Authors: Michael Guerzhoy
- Abstract summary: We outline the considerations for when to compute and when not to compute p-values in different settings encountered by computer science education researchers.
We present summary data and make several preliminary observations about reviewer guidelines.
- Score: 4.057470201629211
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Good (Frequentist) statistical practice requires that statistical tests be
performed in order to determine if the phenomenon being observed could
plausibly occur by chance if the null hypothesis is false. Good practice also
requires that a test is not performed if the study is underpowered: if the
number of observations is not sufficiently large to be able to reliably detect
the effect one hypothesizes, even if the effect exists. Running underpowered
studies runs the risk of false negative results. This creates tension in the
guidelines and expectations for computer science education conferences: while
things are clear for studies with a large number of observations, researchers
should in fact not compute p-values and perform statistical tests if the number
of observations is too small. The issue is particularly live in CSed venues,
since class sizes where those issues are salient are common. We outline the
considerations for when to compute and when not to compute p-values in
different settings encountered by computer science education researchers. We
survey the author and reviewer guidelines in different computer science
education conferences (ICER, SIGCSE TS, ITiCSE, EAAI, CompEd, Koli Calling). We
present summary data and make several preliminary observations about reviewer
guidelines: guidelines vary from conference to conference; guidelines allow for
qualitative studies, and, in some cases, experience reports, but guidelines do
not generally explicitly indicate that a paper should have at least one of (1)
an appropriately-powered statistical analysis or (2) rich qualitative
descriptions. We present preliminary ideas for addressing the tension in the
guidelines between small-n and large-n studies
Related papers
- An Audit of Machine Learning Experiments on Software Defect Prediction [1.2743036577573925]
Machine learning algorithms are widely used to predict defect prone software components.<n>This paper audits recent software defect prediction (SDP) studies by assessing their experimental design, analysis, and reporting practices.
arXiv Detail & Related papers (2026-01-26T13:31:32Z) - Prediction-Powered Causal Inferences [59.98498488132307]
We focus on Prediction-Powered Causal Inferences (PPCI)<n>We first show that conditional calibration guarantees valid PPCI at population level.<n>We then introduce a sufficient representation constraint transferring validity across experiments.
arXiv Detail & Related papers (2025-02-10T10:52:17Z) - Ultra-imbalanced classification guided by statistical information [24.969543903532664]
We take a population-level approach to imbalanced learning by proposing a new formulation called emphultra-imbalanced classification (UIC)
Under UIC, loss functions behave differently even if infinite amount of training samples are available.
A novel learning objective termed Tunable Boosting Loss is developed which is provably resistant against data imbalance under UIC.
arXiv Detail & Related papers (2024-09-06T08:07:09Z) - Smoke and Mirrors in Causal Downstream Tasks [59.90654397037007]
This paper looks at the causal inference task of treatment effect estimation, where the outcome of interest is recorded in high-dimensional observations.
We compare 6 480 models fine-tuned from state-of-the-art visual backbones, and find that the sampling and modeling choices significantly affect the accuracy of the causal estimate.
Our results suggest that future benchmarks should carefully consider real downstream scientific questions, especially causal ones.
arXiv Detail & Related papers (2024-05-27T13:26:34Z) - Are fairness metric scores enough to assess discrimination biases in
machine learning? [4.073786857780967]
We focus on the Bios dataset, and our learning task is to predict the occupation of individuals, based on their biography.
We address an important limitation of theoretical discussions dealing with group-wise fairness metrics: they focus on large datasets.
We then question how reliable are different popular measures of bias when the size of the training set is simply sufficient to learn reasonably accurate predictions.
arXiv Detail & Related papers (2023-06-08T15:56:57Z) - Empirical Design in Reinforcement Learning [23.873958977534993]
It is now common to benchmark agents with millions of parameters against dozens of tasks, each using the equivalent of 30 days of experience.
The scale of these experiments often conflict with the need for proper statistical evidence, especially when comparing algorithms.
This manuscript represents both a call to action, and a comprehensive resource for how to do good experiments in reinforcement learning.
arXiv Detail & Related papers (2023-04-03T19:32:24Z) - Systematic Evaluation of Predictive Fairness [60.0947291284978]
Mitigating bias in training on biased datasets is an important open problem.
We examine the performance of various debiasing methods across multiple tasks.
We find that data conditions have a strong influence on relative model performance.
arXiv Detail & Related papers (2022-10-17T05:40:13Z) - Small data problems in political research: a critical replication study [5.698280399449707]
We show that the small data causes the classification model to be highly sensitive to variations in the random train-test split.
We also show that the applied preprocessing causes the data to be extremely sparse.
Based on our findings, we argue that A&W's conclusions regarding the automated classification of organizational reputation tweets can not be maintained.
arXiv Detail & Related papers (2021-09-27T09:55:58Z) - Near-Optimal Reviewer Splitting in Two-Phase Paper Reviewing and
Conference Experiment Design [76.40919326501512]
We consider the question: how should reviewers be divided between phases or conditions in order to maximize total assignment similarity?
We empirically show that across several datasets pertaining to real conference data, dividing reviewers between phases/conditions uniformly at random allows an assignment that is nearly as good as the oracle optimal assignment.
arXiv Detail & Related papers (2021-08-13T19:29:41Z) - With Little Power Comes Great Responsibility [54.96675741328462]
Underpowered experiments make it more difficult to discern the difference between statistical noise and meaningful model improvements.
Small test sets mean that most attempted comparisons to state of the art models will not be adequately powered.
For machine translation, we find that typical test sets of 2000 sentences have approximately 75% power to detect differences of 1 BLEU point.
arXiv Detail & Related papers (2020-10-13T18:00:02Z) - Marginal likelihood computation for model selection and hypothesis
testing: an extensive review [66.37504201165159]
This article provides a comprehensive study of the state-of-the-art of the topic.
We highlight limitations, benefits, connections and differences among the different techniques.
Problems and possible solutions with the use of improper priors are also described.
arXiv Detail & Related papers (2020-05-17T18:31:58Z) - A Survey on Causal Inference [64.45536158710014]
Causal inference is a critical research topic across many domains, such as statistics, computer science, education, public policy and economics.
Various causal effect estimation methods for observational data have sprung up.
arXiv Detail & Related papers (2020-02-05T21:35:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.