An Audit of Machine Learning Experiments on Software Defect Prediction
- URL: http://arxiv.org/abs/2601.18477v1
- Date: Mon, 26 Jan 2026 13:31:32 GMT
- Title: An Audit of Machine Learning Experiments on Software Defect Prediction
- Authors: Giuseppe Destefanis, Leila Yousefi, Martin Shepperd, Allan Tucker, Stephen Swift, Steve Counsell, Mahir Arzoky,
- Abstract summary: Machine learning algorithms are widely used to predict defect prone software components.<n>This paper audits recent software defect prediction (SDP) studies by assessing their experimental design, analysis, and reporting practices.
- Score: 1.2743036577573925
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Background: Machine learning algorithms are widely used to predict defect prone software components. In this literature, computational experiments are the main means of evaluation, and the credibility of results depends on experimental design and reporting. Objective: This paper audits recent software defect prediction (SDP) studies by assessing their experimental design, analysis, and reporting practices against accepted norms from statistics, machine learning, and empirical software engineering. The aim is to characterise current practice and assess the reproducibility of published results. Method: We audited SDP studies indexed in SCOPUS between 2019 and 2023, focusing on design and analysis choices such as outcome measures, out of sample validation strategies, and the use of statistical inference. Nine study issues were evaluated. Reproducibility was assessed using the instrument proposed by González Barahona and Robles. Results: The search identified approximately 1,585 SDP experiments published during the period. From these, we randomly sampled 101 papers, including 61 journal and 40 conference publications, with almost 50 percent behind paywalls. We observed substantial variation in research practice. The number of datasets ranged from 1 to 365, learners or learner variants from 1 to 34, and performance measures from 1 to 9. About 45 percent of studies applied formal statistical inference. Across the sample, we identified 427 issues, with a median of four per paper, and only one paper without issues. Reproducibility ranged from near complete to severely limited. We also identified two cases of tortured phrases and possible paper mill activity. Conclusions: Experimental design and reporting practices vary widely, and almost half of the studies provide insufficient detail to support reproduction. The audit indicates substantial scope for improvement.
Related papers
- Chasing Shadows: Pitfalls in LLM Security Research [14.334369124449346]
We identify nine common pitfalls that have become relevant with the emergence of large language models (LLMs)<n>These pitfalls span the entire process, from data collection, pre-training, and fine-tuning to prompting and evaluation.<n>We find that every paper contains at least one pitfall, and each pitfall appears in multiple papers. Yet only 15.7% of the present pitfalls were explicitly discussed, suggesting that the majority remain unrecognized.
arXiv Detail & Related papers (2025-12-10T11:39:09Z) - Prediction-Powered Causal Inferences [59.98498488132307]
We focus on Prediction-Powered Causal Inferences (PPCI)<n>We first show that conditional calibration guarantees valid PPCI at population level.<n>We then introduce a sufficient representation constraint transferring validity across experiments.
arXiv Detail & Related papers (2025-02-10T10:52:17Z) - "Estimating software project effort using analogies": Reflections after 28 years [0.0]
The paper examines (i) what was achieved, (ii) what has endured and (iii) what could have been done differently with the benefit of retrospection.<n>The original study emphasised empirical validation with benchmarks, out-of-sample testing and data/tool sharing.
arXiv Detail & Related papers (2025-01-24T15:44:25Z) - A Call for Critically Rethinking and Reforming Data Analysis in Empirical Software Engineering [5.687882380471718]
Concerns about the correct application of empirical methodologies have existed since the 2006 Dagstuhl seminar on Empirical Software Engineering.<n>We conducted a literature survey of 27,000 empirical studies, using LLMs to classify statistical methodologies as adequate or inadequate.<n>We selected 30 primary studies and held a workshop with 33 ESE experts to assess their ability to identify and resolve statistical issues.
arXiv Detail & Related papers (2025-01-22T09:05:01Z) - Crossover Designs in Software Engineering Experiments: Review of the State of Analysis [4.076290837395956]
Vegas et al. reviewed the state of practice for crossover designs in Software Engineering (SE) research.<n>This paper reviews the state of analysis of crossover design experiments in SE publications between 2015 and 2024.<n>Despite the explicit guidelines, only 29.5% of all threats to validity were addressed properly.
arXiv Detail & Related papers (2024-08-14T14:49:25Z) - "Medium-n studies" in computing education conferences [4.057470201629211]
We outline the considerations for when to compute and when not to compute p-values in different settings encountered by computer science education researchers.
We present summary data and make several preliminary observations about reviewer guidelines.
arXiv Detail & Related papers (2023-11-01T15:25:49Z) - Too Good To Be True: performance overestimation in (re)current practices
for Human Activity Recognition [49.1574468325115]
sliding windows for data segmentation followed by standard random k-fold cross validation produce biased results.
It is important to raise awareness in the scientific community about this problem, whose negative effects are being overlooked.
Several experiments with different types of datasets and different types of classification models allow us to exhibit the problem and show it persists independently of the method or dataset.
arXiv Detail & Related papers (2023-10-18T13:24:05Z) - A Double Machine Learning Approach to Combining Experimental and Observational Data [58.05402364136958]
We propose a double machine learning approach to combine experimental and observational studies.<n>Our framework proposes a falsification test for external validity and ignorability under milder assumptions.
arXiv Detail & Related papers (2023-07-04T02:53:11Z) - The MultiBERTs: BERT Reproductions for Robustness Analysis [86.29162676103385]
Re-running pretraining can lead to substantially different conclusions about performance.
We introduce MultiBERTs: a set of 25 BERT-base checkpoints.
The aim is to enable researchers to draw robust and statistically justified conclusions about pretraining procedures.
arXiv Detail & Related papers (2021-06-30T15:56:44Z) - With Little Power Comes Great Responsibility [54.96675741328462]
Underpowered experiments make it more difficult to discern the difference between statistical noise and meaningful model improvements.
Small test sets mean that most attempted comparisons to state of the art models will not be adequately powered.
For machine translation, we find that typical test sets of 2000 sentences have approximately 75% power to detect differences of 1 BLEU point.
arXiv Detail & Related papers (2020-10-13T18:00:02Z) - Showing Your Work Doesn't Always Work [73.63200097493576]
"Show Your Work: Improved Reporting of Experimental Results" advocates for reporting the expected validation effectiveness of the best-tuned model.
We analytically show that their estimator is biased and uses error-prone assumptions.
We derive an unbiased alternative and bolster our claims with empirical evidence from statistical simulation.
arXiv Detail & Related papers (2020-04-28T17:59:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.