Effects of Training Data Quality on Classifier Performance
- URL: http://arxiv.org/abs/2602.21462v1
- Date: Wed, 25 Feb 2026 00:29:51 GMT
- Title: Effects of Training Data Quality on Classifier Performance
- Authors: Alan F. Karr, Regina Ruane,
- Abstract summary: We examine the effects of degrading the quality of the training data by multiple mechanisms.<n>We find breakdown-like behavior that holds for all four classifiers, as degradation increases and they move from being mostly correct to only coincidentally correct.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We describe extensive numerical experiments assessing and quantifying how classifier performance depends on the quality of the training data, a frequently neglected component of the analysis of classifiers. More specifically, in the scientific context of metagenomic assembly of short DNA reads into "contigs," we examine the effects of degrading the quality of the training data by multiple mechanisms, and for four classifiers -- Bayes classifiers, neural nets, partition models and random forests. We investigate both individual behavior and congruence among the classifiers. We find breakdown-like behavior that holds for all four classifiers, as degradation increases and they move from being mostly correct to only coincidentally correct, because they are wrong in the same way. In the process, a picture of spatial heterogeneity emerges: as the training data move farther from analysis data, classifier decisions degenerate, the boundary becomes less dense, and congruence increases.
Related papers
- Generative Classifiers Avoid Shortcut Solutions [84.23247217037134]
Discriminative approaches to classification often learn shortcuts that hold in-distribution but fail under minor distribution shift.<n>We show that generative classifiers can avoid this issue by modeling all features, both core and spurious, instead of mainly spurious ones.<n>We find that diffusion-based and autorerimigressive generative classifiers achieve state-of-the-art performance on five standard image and text distribution shift benchmarks.
arXiv Detail & Related papers (2025-12-31T18:31:46Z) - Understanding Data Influence with Differential Approximation [63.817689230826595]
We introduce a new formulation to approximate a sample's influence by accumulating the differences in influence between consecutive learning steps, which we term Diff-In.<n>By employing second-order approximations, we approximate these difference terms with high accuracy while eliminating the need for model convexity required by existing methods.<n>Our theoretical analysis demonstrates that Diff-In achieves significantly lower approximation error compared to existing influence estimators.
arXiv Detail & Related papers (2025-08-20T11:59:32Z) - On the Interconnections of Calibration, Quantification, and Classifier Accuracy Prediction under Dataset Shift [58.91436551466064]
This paper investigates the interconnections among three fundamental problems, calibration, and quantification, under dataset shift conditions.<n>We show that access to an oracle for any one of these tasks enables the resolution of the other two.<n>We propose new methods for each problem based on direct adaptations of well-established methods borrowed from the other disciplines.
arXiv Detail & Related papers (2025-05-16T15:42:55Z) - Sub-Clustering for Class Distance Recalculation in Long-Tailed Drug Classification [3.015770349327888]
In the field of drug chemistry, certain tail classes exhibit higher identifiability during training due to their unique molecular structural features.<n>We propose a novel method that breaks away from the traditional static evaluation paradigm based on sample size.
arXiv Detail & Related papers (2025-04-07T00:09:10Z) - Boosting of Classification Models with Human-in-the-Loop Computational Visual Knowledge Discovery [2.9465623430708905]
This paper proposes moving boosting methodology from focusing on only misclassified cases to all cases in the class overlap areas.<n>A Divide and Classify process splits cases to simple and complex, classifying these individually through computational analysis and data visualization.<n>After finding pure and overlap class areas simple cases in pure areas are classified, generating interpretable sub-models like decision rules in Propositional and First-order Logics.
arXiv Detail & Related papers (2025-02-10T21:09:19Z) - PULASki: Learning inter-rater variability using statistical distances to improve probabilistic segmentation [35.34932609930401]
This work proposes the PULASki method as a computationally efficient generative tool for biomedical image segmentation.<n>It captures variability in expert annotations, even in small datasets.<n>Our experiments are also the first to present a comparative study of the computationally feasible segmentation of complex geometries using 3D patches and the traditional use of 2D slices.
arXiv Detail & Related papers (2023-12-25T10:31:22Z) - Equivariance Allows Handling Multiple Nuisance Variables When Analyzing
Pooled Neuroimaging Datasets [53.34152466646884]
In this paper, we show how bringing recent results on equivariant representation learning instantiated on structured spaces together with simple use of classical results on causal inference provides an effective practical solution.
We demonstrate how our model allows dealing with more than one nuisance variable under some assumptions and can enable analysis of pooled scientific datasets in scenarios that would otherwise entail removing a large portion of the samples.
arXiv Detail & Related papers (2022-03-29T04:54:06Z) - How Nonconformity Functions and Difficulty of Datasets Impact the
Efficiency of Conformal Classifiers [0.1611401281366893]
In conformal classification, the systems can output multiple class labels instead of one.
For a Neural Network-based conformal classifier, the inverse probability allows minimizing the average number of predicted labels.
We propose a successful method to combine the properties of these two nonconformity functions.
arXiv Detail & Related papers (2021-08-12T11:50:12Z) - The Effect of the Loss on Generalization: Empirical Study on Synthetic
Lung Nodule Data [13.376247652484274]
We show that different loss functions lead to different features being learned and consequently affect the generalization ability of the classifier on unseen data.
This study provides some important insights into the design of deep learning solutions for medical imaging tasks.
arXiv Detail & Related papers (2021-08-10T17:58:01Z) - Theoretical Insights Into Multiclass Classification: A High-dimensional
Asymptotic View [82.80085730891126]
We provide the first modernally precise analysis of linear multiclass classification.
Our analysis reveals that the classification accuracy is highly distribution-dependent.
The insights gained may pave the way for a precise understanding of other classification algorithms.
arXiv Detail & Related papers (2020-11-16T05:17:29Z) - A Systematic Evaluation: Fine-Grained CNN vs. Traditional CNN
Classifiers [54.996358399108566]
We investigate the performance of the landmark general CNN classifiers, which presented top-notch results on large scale classification datasets.
We compare it against state-of-the-art fine-grained classifiers.
We show an extensive evaluation on six datasets to determine whether the fine-grained classifier is able to elevate the baseline in their experiments.
arXiv Detail & Related papers (2020-03-24T23:49:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.