Related papers: "Medium-n studies" in computing education conferences

"Medium-n studies" in computing education conferences

URL: http://arxiv.org/abs/2311.14679v2
Date: Tue, 28 Nov 2023 14:32:13 GMT
Title: "Medium-n studies" in computing education conferences
Authors: Michael Guerzhoy
Abstract summary: We outline the considerations for when to compute and when not to compute p-values in different settings encountered by computer science education researchers. We present summary data and make several preliminary observations about reviewer guidelines.
Score: 4.057470201629211
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Good (Frequentist) statistical practice requires that statistical tests be performed in order to determine if the phenomenon being observed could plausibly occur by chance if the null hypothesis is false. Good practice also requires that a test is not performed if the study is underpowered: if the number of observations is not sufficiently large to be able to reliably detect the effect one hypothesizes, even if the effect exists. Running underpowered studies runs the risk of false negative results. This creates tension in the guidelines and expectations for computer science education conferences: while things are clear for studies with a large number of observations, researchers should in fact not compute p-values and perform statistical tests if the number of observations is too small. The issue is particularly live in CSed venues, since class sizes where those issues are salient are common. We outline the considerations for when to compute and when not to compute p-values in different settings encountered by computer science education researchers. We survey the author and reviewer guidelines in different computer science education conferences (ICER, SIGCSE TS, ITiCSE, EAAI, CompEd, Koli Calling). We present summary data and make several preliminary observations about reviewer guidelines: guidelines vary from conference to conference; guidelines allow for qualitative studies, and, in some cases, experience reports, but guidelines do not generally explicitly indicate that a paper should have at least one of (1) an appropriately-powered statistical analysis or (2) rich qualitative descriptions. We present preliminary ideas for addressing the tension in the guidelines between small-n and large-n studies

Related papers

An Audit of Machine Learning Experiments on Software Defect Prediction [1.2743036577573925]
Machine learning algorithms are widely used to predict defect prone software components.<n>This paper audits recent software defect prediction (SDP) studies by assessing their experimental design, analysis, and reporting practices.
arXiv Detail & Related papers (2026-01-26T13:31:32Z)
Prediction-Powered Causal Inferences [59.98498488132307]
We focus on Prediction-Powered Causal Inferences (PPCI)<n>We first show that conditional calibration guarantees valid PPCI at population level.<n>We then introduce a sufficient representation constraint transferring validity across experiments.
arXiv Detail & Related papers (2025-02-10T10:52:17Z)
Ultra-imbalanced classification guided by statistical information [24.969543903532664]
We take a population-level approach to imbalanced learning by proposing a new formulation called emphultra-imbalanced classification (UIC) Under UIC, loss functions behave differently even if infinite amount of training samples are available. A novel learning objective termed Tunable Boosting Loss is developed which is provably resistant against data imbalance under UIC.
arXiv Detail & Related papers (2024-09-06T08:07:09Z)
Smoke and Mirrors in Causal Downstream Tasks [59.90654397037007]
This paper looks at the causal inference task of treatment effect estimation, where the outcome of interest is recorded in high-dimensional observations. We compare 6 480 models fine-tuned from state-of-the-art visual backbones, and find that the sampling and modeling choices significantly affect the accuracy of the causal estimate. Our results suggest that future benchmarks should carefully consider real downstream scientific questions, especially causal ones.
arXiv Detail & Related papers (2024-05-27T13:26:34Z)
Are fairness metric scores enough to assess discrimination biases in machine learning? [4.073786857780967]
We focus on the Bios dataset, and our learning task is to predict the occupation of individuals, based on their biography. We address an important limitation of theoretical discussions dealing with group-wise fairness metrics: they focus on large datasets. We then question how reliable are different popular measures of bias when the size of the training set is simply sufficient to learn reasonably accurate predictions.
arXiv Detail & Related papers (2023-06-08T15:56:57Z)
Empirical Design in Reinforcement Learning [23.873958977534993]
It is now common to benchmark agents with millions of parameters against dozens of tasks, each using the equivalent of 30 days of experience. The scale of these experiments often conflict with the need for proper statistical evidence, especially when comparing algorithms. This manuscript represents both a call to action, and a comprehensive resource for how to do good experiments in reinforcement learning.
arXiv Detail & Related papers (2023-04-03T19:32:24Z)
Systematic Evaluation of Predictive Fairness [60.0947291284978]
Mitigating bias in training on biased datasets is an important open problem. We examine the performance of various debiasing methods across multiple tasks. We find that data conditions have a strong influence on relative model performance.
arXiv Detail & Related papers (2022-10-17T05:40:13Z)
Small data problems in political research: a critical replication study [5.698280399449707]
We show that the small data causes the classification model to be highly sensitive to variations in the random train-test split. We also show that the applied preprocessing causes the data to be extremely sparse. Based on our findings, we argue that A&W's conclusions regarding the automated classification of organizational reputation tweets can not be maintained.
arXiv Detail & Related papers (2021-09-27T09:55:58Z)
Near-Optimal Reviewer Splitting in Two-Phase Paper Reviewing and Conference Experiment Design [76.40919326501512]
We consider the question: how should reviewers be divided between phases or conditions in order to maximize total assignment similarity? We empirically show that across several datasets pertaining to real conference data, dividing reviewers between phases/conditions uniformly at random allows an assignment that is nearly as good as the oracle optimal assignment.
arXiv Detail & Related papers (2021-08-13T19:29:41Z)
With Little Power Comes Great Responsibility [54.96675741328462]
Underpowered experiments make it more difficult to discern the difference between statistical noise and meaningful model improvements. Small test sets mean that most attempted comparisons to state of the art models will not be adequately powered. For machine translation, we find that typical test sets of 2000 sentences have approximately 75% power to detect differences of 1 BLEU point.
arXiv Detail & Related papers (2020-10-13T18:00:02Z)
Marginal likelihood computation for model selection and hypothesis testing: an extensive review [66.37504201165159]
This article provides a comprehensive study of the state-of-the-art of the topic. We highlight limitations, benefits, connections and differences among the different techniques. Problems and possible solutions with the use of improper priors are also described.
arXiv Detail & Related papers (2020-05-17T18:31:58Z)
A Survey on Causal Inference [64.45536158710014]
Causal inference is a critical research topic across many domains, such as statistics, computer science, education, public policy and economics. Various causal effect estimation methods for observational data have sprung up.
arXiv Detail & Related papers (2020-02-05T21:35:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.