Adaptive Label Error Detection: A Bayesian Approach to Mislabeled Data Detection
- URL: http://arxiv.org/abs/2601.10084v1
- Date: Thu, 15 Jan 2026 05:20:00 GMT
- Title: Adaptive Label Error Detection: A Bayesian Approach to Mislabeled Data Detection
- Authors: Zan Chaudhry, Noam H. Rotenberg, Brian Caffo, Craig K. Jones, Haris I. Sair,
- Abstract summary: We motivate and describe Adaptive Label Error Detection (ALED), a novel method of detecting mislabeling.<n>ALED has markedly increased sensitivity, without compromising precision, compared to established label error detection methods.<n>We demonstrate an example where fine-tuning a neural network on corrected data results in a 33.8% decrease in test set errors.
- Score: 0.5284217353503208
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Machine learning classification systems are susceptible to poor performance when trained with incorrect ground truth labels, even when data is well-curated by expert annotators. As machine learning becomes more widespread, it is increasingly imperative to identify and correct mislabeling to develop more powerful models. In this work, we motivate and describe Adaptive Label Error Detection (ALED), a novel method of detecting mislabeling. ALED extracts an intermediate feature space from a deep convolutional neural network, denoises the features, models the reduced manifold of each class with a multidimensional Gaussian distribution, and performs a simple likelihood ratio test to identify mislabeled samples. We show that ALED has markedly increased sensitivity, without compromising precision, compared to established label error detection methods, on multiple medical imaging datasets. We demonstrate an example where fine-tuning a neural network on corrected data results in a 33.8% decrease in test set errors, providing strong benefits to end users. The ALED detector is deployed in the Python package statlab.
Related papers
- Detecting and Rectifying Noisy Labels: A Similarity-based Approach [4.686586017523293]
Label noise in datasets could significantly damage the performance and robustness of deep neural networks (DNNs) trained on these datasets.<n>We propose post-hoc, model-agnostic noise detection and rectification methods utilizing the penultimate feature from a DNN.<n>Our idea is based on the observation that the similarity between the penultimate feature of a mislabeled data point and its true class data points is higher than that for data points from other classes.
arXiv Detail & Related papers (2025-09-28T16:41:56Z) - Improving Label Error Detection and Elimination with Uncertainty Quantification [5.184615738004059]
We develop novel, model-agnostic algorithms for Uncertainty Quantification-Based Label Error Detection (UQ-LED)
Our UQ-LED algorithms outperform state-of-the-art confident learning in identifying label errors.
We propose a novel approach to generate realistic, class-dependent label errors synthetically.
arXiv Detail & Related papers (2024-05-15T15:17:52Z) - All Points Matter: Entropy-Regularized Distribution Alignment for
Weakly-supervised 3D Segmentation [67.30502812804271]
Pseudo-labels are widely employed in weakly supervised 3D segmentation tasks where only sparse ground-truth labels are available for learning.
We propose a novel learning strategy to regularize the generated pseudo-labels and effectively narrow the gaps between pseudo-labels and model predictions.
arXiv Detail & Related papers (2023-05-25T08:19:31Z) - Identifying Label Errors in Object Detection Datasets by Loss Inspection [4.442111891959355]
We introduce a benchmark for label error detection methods on object detection datasets.
We simulate four different types of randomly introduced label errors on train and test sets of well-labeled object detection datasets.
arXiv Detail & Related papers (2023-03-13T10:54:52Z) - Dist-PU: Positive-Unlabeled Learning from a Label Distribution
Perspective [89.5370481649529]
We propose a label distribution perspective for PU learning in this paper.
Motivated by this, we propose to pursue the label distribution consistency between predicted and ground-truth label distributions.
Experiments on three benchmark datasets validate the effectiveness of the proposed method.
arXiv Detail & Related papers (2022-12-06T07:38:29Z) - CTRL: Clustering Training Losses for Label Error Detection [4.49681473359251]
In supervised machine learning, use of correct labels is extremely important to ensure high accuracy.
We propose a novel framework, calledClustering TRaining Losses for label error detection.
It detects label errors in two steps based on the observation that models learn clean and noisy labels in different ways.
arXiv Detail & Related papers (2022-08-17T18:09:19Z) - Automated Detection of Label Errors in Semantic Segmentation Datasets via Deep Learning and Uncertainty Quantification [5.279257531335345]
We for the first time present a method for detecting label errors in semantic segmentation datasets with pixel-wise labels.
Our approach is able to detect the vast majority of label errors while controlling the number of false label error detections.
arXiv Detail & Related papers (2022-07-13T10:25:23Z) - Active Learning by Feature Mixing [52.16150629234465]
We propose a novel method for batch active learning called ALFA-Mix.
We identify unlabelled instances with sufficiently-distinct features by seeking inconsistencies in predictions.
We show that inconsistencies in these predictions help discovering features that the model is unable to recognise in the unlabelled instances.
arXiv Detail & Related papers (2022-03-14T12:20:54Z) - SLA$^2$P: Self-supervised Anomaly Detection with Adversarial
Perturbation [77.71161225100927]
Anomaly detection is a fundamental yet challenging problem in machine learning.
We propose a novel and powerful framework, dubbed as SLA$2$P, for unsupervised anomaly detection.
arXiv Detail & Related papers (2021-11-25T03:53:43Z) - Minimax Active Learning [61.729667575374606]
Active learning aims to develop label-efficient algorithms by querying the most representative samples to be labeled by a human annotator.
Current active learning techniques either rely on model uncertainty to select the most uncertain samples or use clustering or reconstruction to choose the most diverse set of unlabeled examples.
We develop a semi-supervised minimax entropy-based active learning algorithm that leverages both uncertainty and diversity in an adversarial manner.
arXiv Detail & Related papers (2020-12-18T19:03:40Z) - Learning with Out-of-Distribution Data for Audio Classification [60.48251022280506]
We show that detecting and relabelling certain OOD instances, rather than discarding them, can have a positive effect on learning.
The proposed method is shown to improve the performance of convolutional neural networks by a significant margin.
arXiv Detail & Related papers (2020-02-11T21:08:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.