Related papers: Crossover Designs in Software Engineering Experiments: Review of the State of Analysis

Crossover Designs in Software Engineering Experiments: Review of the State of Analysis

URL: http://arxiv.org/abs/2408.07594v2
Date: Tue, 07 Jan 2025 09:00:07 GMT
Title: Crossover Designs in Software Engineering Experiments: Review of the State of Analysis
Authors: Julian Frattini, Davide Fucci, Sira Vegas,
Abstract summary: Vegas et al. reviewed the state of practice for crossover designs in Software Engineering (SE) research.<n>This paper reviews the state of analysis of crossover design experiments in SE publications between 2015 and 2024.<n>Despite the explicit guidelines, only 29.5% of all threats to validity were addressed properly.
Score: 4.076290837395956
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Experimentation is an essential method for causal inference in any empirical discipline. Crossover-design experiments are common in Software Engineering (SE) research. In these, subjects apply more than one treatment in different orders. This design increases the amount of obtained data and deals with subject variability but introduces threats to internal validity like the learning and carryover effect. Vegas et al. reviewed the state of practice for crossover designs in SE research and provided guidelines on how to address its threats during data analysis while still harnessing its benefits. In this paper, we reflect on the impact of these guidelines and review the state of analysis of crossover design experiments in SE publications between 2015 and March 2024. To this end, by conducting a forward snowballing of the guidelines, we survey 136 publications reporting 67 crossover-design experiments and evaluate their data analysis against the provided guidelines. The results show that the validity of data analyses has improved compared to the original state of analysis. Still, despite the explicit guidelines, only 29.5% of all threats to validity were addressed properly. While the maturation and the optimal sequence threats are properly addressed in 35.8% and 38.8% of all studies in our sample respectively, the carryover threat is only modeled in about 3% of the observed cases. The lack of adherence to the analysis guidelines threatens the validity of the conclusions drawn from crossover design experiments

Related papers

An Audit of Machine Learning Experiments on Software Defect Prediction [1.2743036577573925]
Machine learning algorithms are widely used to predict defect prone software components.<n>This paper audits recent software defect prediction (SDP) studies by assessing their experimental design, analysis, and reporting practices.
arXiv Detail & Related papers (2026-01-26T13:31:32Z)
Exploring the Garden of Forking Paths in Empirical Software Engineering Research: A Multiverse Analysis [3.6324565773746147]
We conduct a so-called multiverse analysis on a published empirical SE paper.<n>We identify nine pivotal analytical decisions with at least one equally defensible alternative.<n>The overwhelming majority produced qualitatively different, and sometimes even opposite, findings.
arXiv Detail & Related papers (2025-12-09T18:47:00Z)
Data Fusion for Partial Identification of Causal Effects [62.56890808004615]
We propose a novel partial identification framework that enables researchers to answer key questions.<n>Is the causal effect positive or negative? and How severe must assumption violations be to overturn this conclusion?<n>We apply our framework to the Project STAR study, which investigates the effect of classroom size on students' third-grade standardized test performance.
arXiv Detail & Related papers (2025-05-30T07:13:01Z)
(Mis)Fitting: A Survey of Scaling Laws [52.598843243928584]
We discuss discrepancies in the conclusions that several prior works reach, on questions such as the optimal token to parameter ratio. We survey over 50 papers that study scaling trends. We propose a checklist for authors to consider while contributing to scaling law research.
arXiv Detail & Related papers (2025-02-26T09:27:54Z)
Prediction-Powered Causal Inferences [59.98498488132307]
We focus on Prediction-Powered Causal Inferences (PPCI)<n>We first show that conditional calibration guarantees valid PPCI at population level.<n>We then introduce a sufficient representation constraint transferring validity across experiments.
arXiv Detail & Related papers (2025-02-10T10:52:17Z)
Mitigating Omitted Variable Bias in Empirical Software Engineering [4.389150156866014]
omitted variable bias occurs when a statistical model leaves out variables that are relevant determinants of the effects under study. Omitted variable bias presents a significant threat to the validity of empirical research. This paper demonstrates a sequence of analysis steps that inform the design and execution of any empirical study in software engineering.
arXiv Detail & Related papers (2025-01-28T15:43:46Z)
A Call for Critically Rethinking and Reforming Data Analysis in Empirical Software Engineering [5.687882380471718]
Concerns about the correct application of empirical methodologies have existed since the 2006 Dagstuhl seminar on Empirical Software Engineering. We conducted a literature survey of 27,000 empirical studies, using LLMs to classify statistical methodologies as adequate or inadequate. We selected 30 primary studies and held a workshop with 33 ESE experts to assess their ability to identify and resolve statistical issues.
arXiv Detail & Related papers (2025-01-22T09:05:01Z)
Good practices for evaluation of machine learning systems [28.2601701453212]
We discuss the main aspects involved in the design of the evaluation protocol: data selection, metric selection, and statistical significance. We include examples taken from the speech processing field, and provide a list of common mistakes related to each aspect.
arXiv Detail & Related papers (2024-12-04T20:30:16Z)
A Second Look at the Impact of Passive Voice Requirements on Domain Modeling: Bayesian Reanalysis of an Experiment [4.649794383775257]
We reanalyze the only known controlled experiment investigating the impact of passive voice on the subsequent activity of domain modeling. Our results reveal that the effects observed by the original authors turned out to be much less significant than previously assumed.
arXiv Detail & Related papers (2024-02-16T16:24:00Z)
Dive into the Chasm: Probing the Gap between In- and Cross-Topic Generalization [66.4659448305396]
This study analyzes various LMs with three probing-based experiments to shed light on the reasons behind the In- vs. Cross-Topic generalization gap. We demonstrate, for the first time, that generalization gaps and the robustness of the embedding space vary significantly across LMs.
arXiv Detail & Related papers (2024-02-02T12:59:27Z)
How Dataflow Diagrams Impact Software Security Analysis: an Empirical Experiment [5.6169596483204085]
We present the findings of an empirical experiment conducted to investigate DFDs’ impact on the performance of analysts in a security analysis setting. We found that the participants performed significantly better in answering the analysis tasks correctly in the model-supported condition. We identified three open challenges of using DFDs for security analysis based on the insights gained in the experiment.
arXiv Detail & Related papers (2024-01-09T09:22:35Z)
Ovarian Cancer Data Analysis using Deep Learning: A Systematic Review from the Perspectives of Key Features of Data Analysis and AI Assurance [0.0]
Machine or Deep Learning (ML/DL)-based autonomous data analysis tools can assist clinicians and cancer researchers in discovering patterns and relationships from complex data sets. Many DL-based analyses on ovarian cancer (OC) data have recently been published. However, a comprehensive understanding of these analyses in terms of these features and AI assurance (AIA) is currently lacking.
arXiv Detail & Related papers (2023-11-20T17:17:29Z)
Too Good To Be True: performance overestimation in (re)current practices for Human Activity Recognition [49.1574468325115]
sliding windows for data segmentation followed by standard random k-fold cross validation produce biased results. It is important to raise awareness in the scientific community about this problem, whose negative effects are being overlooked. Several experiments with different types of datasets and different types of classification models allow us to exhibit the problem and show it persists independently of the method or dataset.
arXiv Detail & Related papers (2023-10-18T13:24:05Z)
Selective Nonparametric Regression via Testing [54.20569354303575]
We develop an abstention procedure via testing the hypothesis on the value of the conditional variance at a given point. Unlike existing methods, the proposed one allows to account not only for the value of the variance itself but also for the uncertainty of the corresponding variance predictor.
arXiv Detail & Related papers (2023-09-28T13:04:11Z)
Pitfalls in Experiments with DNN4SE: An Analysis of the State of the Practice [0.7614628596146599]
We conduct a mapping study, examining 194 experiments with techniques that rely on deep neural networks appearing in 55 papers published in premier software engineering venues. Our study reveals that most of the experiments, including those that have received ACM artifact badges, have fundamental limitations that raise doubts about the reliability of their findings.
arXiv Detail & Related papers (2023-05-19T09:55:48Z)
Assaying Out-Of-Distribution Generalization in Transfer Learning [103.57862972967273]
We take a unified view of previous work, highlighting message discrepancies that we address empirically. We fine-tune over 31k networks, from nine different architectures in the many- and few-shot setting.
arXiv Detail & Related papers (2022-07-19T12:52:33Z)
TraSE: Towards Tackling Authorial Style from a Cognitive Science Perspective [4.123763595394021]
Authorship attribution experiments with over 27,000 authors and 1.4 million samples in a cross-domain scenario resulted in 90% attribution accuracy. A qualitative analysis is performed on TraSE using physical human characteristics, like age, to validate its claim on capturing cognitive traits.
arXiv Detail & Related papers (2022-06-21T19:55:07Z)
SurvITE: Learning Heterogeneous Treatment Effects from Time-to-Event Data [83.50281440043241]
We study the problem of inferring heterogeneous treatment effects from time-to-event data. We propose a novel deep learning method for treatment-specific hazard estimation based on balancing representations.
arXiv Detail & Related papers (2021-10-26T20:13:17Z)
Stable Prediction via Leveraging Seed Variable [73.9770220107874]
Previous machine learning methods might exploit subtly spurious correlations in training data induced by non-causal variables for prediction. We propose a conditional independence test based algorithm to separate causal variables with a seed variable as priori, and adopt them for stable prediction. Our algorithm outperforms state-of-the-art methods for stable prediction.
arXiv Detail & Related papers (2020-06-09T06:56:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.