When Data Quality Issues Collide: A Large-Scale Empirical Study of Co-Occurring Data Quality Issues in Software Defect Prediction
- URL: http://arxiv.org/abs/2512.17460v1
- Date: Fri, 19 Dec 2025 11:21:12 GMT
- Title: When Data Quality Issues Collide: A Large-Scale Empirical Study of Co-Occurring Data Quality Issues in Software Defect Prediction
- Authors: Emmanuel Charleson Dapaah, Jens Grabowski,
- Abstract summary: We present the first large-scale empirical analysis in SDP that simultaneously examines five co-occurring data quality issues.<n>Even the least frequent issue (attribute noise) appears alongside others in more than 93% of datasets.<n>We identify stable tipping points around 0.20 for class overlap, 0.65-0.70 for imbalance, and 0.94 for irrelevance, beyond which most models begin to degrade.
- Score: 0.3867363075280543
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Software Defect Prediction (SDP) models are central to proactive software quality assurance, yet their effectiveness is often constrained by the quality of available datasets. Prior research has typically examined single issues such as class imbalance or feature irrelevance in isolation, overlooking that real-world data problems frequently co-occur and interact. This study presents, to our knowledge, the first large-scale empirical analysis in SDP that simultaneously examines five co-occurring data quality issues (class imbalance, class overlap, irrelevant features, attribute noise, and outliers) across 374 datasets and five classifiers. We employ Explainable Boosting Machines together with stratified interaction analysis to quantify both direct and conditional effects under default hyperparameter settings, reflecting practical baseline usage. Our results show that co-occurrence is nearly universal: even the least frequent issue (attribute noise) appears alongside others in more than 93% of datasets. Irrelevant features and imbalance are nearly ubiquitous, while class overlap is the most consistently harmful issue. We identify stable tipping points around 0.20 for class overlap, 0.65-0.70 for imbalance, and 0.94 for irrelevance, beyond which most models begin to degrade. We also uncover counterintuitive patterns, such as outliers improving performance when irrelevant features are low, underscoring the importance of context-aware evaluation. Finally, we expose a performance-robustness trade-off: no single learner dominates under all conditions. By jointly analyzing prevalence, co-occurrence, thresholds, and conditional effects, our study directly addresses a persistent gap in SDP research. Hence, moving beyond isolated analyses to provide a holistic, data-aware understanding of how quality issues shape model performance in real-world settings.
Related papers
- Differential Privacy in Two-Layer Networks: How DP-SGD Harms Fairness and Robustness [2.9327666088683664]
This paper introduces a unified feature-centric framework to analyze the feature learning dynamics of differentially private gradient.<n>We demonstrate that the noise required for privacy leads to suboptimal feature learning networks.
arXiv Detail & Related papers (2026-03-05T07:19:31Z) - Can Causality Cure Confusion Caused By Correlation (in Software Analytics)? [4.082216579462797]
Symbolic models, particularly decision trees, are widely used in software engineering for explainable analytics.<n>Recent studies in software engineering show that both correlational models and causal discovery algorithms suffer from pronounced instability.<n>This study investigates causality-aware split criteria into symbolic models to improve their stability and robustness.
arXiv Detail & Related papers (2026-02-17T23:35:50Z) - Robustness of Probabilistic Models to Low-Quality Data: A Multi-Perspective Analysis [23.834741751854448]
A systematic, comparative investigation into the effects of low-quality data reveals a stark spectrum of robustness across modern probabilistic models.<n>We find that autoregressive language models, from token prediction to sequence-to-sequence tasks, are remarkably resilient.<n>Under the same levels of data corruption, class-conditional diffusion models degrade catastrophically.
arXiv Detail & Related papers (2025-12-11T02:10:41Z) - Conformal-in-the-Loop for Learning with Imbalanced Noisy Data [5.69777817429044]
Class imbalance and label noise are pervasive in large-scale datasets.<n>Much of machine learning research assumes well-labeled, balanced data, which rarely reflects real world conditions.<n>We propose Conformal-in-the-Loop (CitL), a novel training framework that addresses both challenges with a conformal prediction-based approach.
arXiv Detail & Related papers (2024-11-04T17:09:58Z) - Autoencoder based approach for the mitigation of spurious correlations [2.7624021966289605]
Spurious correlations refer to erroneous associations in data that do not reflect true underlying relationships.
These correlations can lead deep neural networks (DNNs) to learn patterns that are not robust across diverse datasets or real-world scenarios.
We propose an autoencoder-based approach to analyze the nature of spurious correlations that exist in the Global Wheat Head Detection (GWHD) 2021 dataset.
arXiv Detail & Related papers (2024-06-27T05:28:44Z) - SYNAuG: Exploiting Synthetic Data for Data Imbalance Problems [39.675787338941184]
This paper explores the potential of synthetic data to address the data imbalance problem.
To be specific, our method, dubbed SYNAuG, leverages synthetic data to equalize the unbalanced distribution of training data.
Our experiments demonstrate that, although a domain gap between real and synthetic data exists, training with SYNAuG followed by fine-tuning with a few real samples allows to achieve impressive performance.
arXiv Detail & Related papers (2023-08-02T07:59:25Z) - From Contextual Data to Newsvendor Decisions: On the Actual Performance of Data-Driven Algorithms [8.89658755359509]
We study how the relevance/quality and quantity of past data influence performance by analyzing a contextual Newsvendor problem.<n>We analyze the performance of data-driven algorithms through a notion of context-dependent worst-case expected regret.
arXiv Detail & Related papers (2023-02-16T17:03:39Z) - Stateful Offline Contextual Policy Evaluation and Learning [88.9134799076718]
We study off-policy evaluation and learning from sequential data.
We formalize the relevant causal structure of problems such as dynamic personalized pricing.
We show improved out-of-sample policy performance in this class of relevant problems.
arXiv Detail & Related papers (2021-10-19T16:15:56Z) - Doing Great at Estimating CATE? On the Neglected Assumptions in
Benchmark Comparisons of Treatment Effect Estimators [91.3755431537592]
We show that even in arguably the simplest setting, estimation under ignorability assumptions can be misleading.
We consider two popular machine learning benchmark datasets for evaluation of heterogeneous treatment effect estimators.
We highlight that the inherent characteristics of the benchmark datasets favor some algorithms over others.
arXiv Detail & Related papers (2021-07-28T13:21:27Z) - Competency Problems: On Finding and Removing Artifacts in Language Data [50.09608320112584]
We argue that for complex language understanding tasks, all simple feature correlations are spurious.
We theoretically analyze the difficulty of creating data for competency problems when human bias is taken into account.
arXiv Detail & Related papers (2021-04-17T21:34:10Z) - On Disentangled Representations Learned From Correlated Data [59.41587388303554]
We bridge the gap to real-world scenarios by analyzing the behavior of the most prominent disentanglement approaches on correlated data.
We show that systematically induced correlations in the dataset are being learned and reflected in the latent representations.
We also demonstrate how to resolve these latent correlations, either using weak supervision during training or by post-hoc correcting a pre-trained model with a small number of labels.
arXiv Detail & Related papers (2020-06-14T12:47:34Z) - Causal Feature Selection for Algorithmic Fairness [61.767399505764736]
We consider fairness in the integration component of data management.
We propose an approach to identify a sub-collection of features that ensure the fairness of the dataset.
arXiv Detail & Related papers (2020-06-10T20:20:10Z) - Long-Tailed Recognition Using Class-Balanced Experts [128.73438243408393]
We propose an ensemble of class-balanced experts that combines the strength of diverse classifiers.
Our ensemble of class-balanced experts reaches results close to state-of-the-art and an extended ensemble establishes a new state-of-the-art on two benchmarks for long-tailed recognition.
arXiv Detail & Related papers (2020-04-07T20:57:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.