Pervasive Label Errors in Test Sets Destabilize Machine Learning
Benchmarks
- URL: http://arxiv.org/abs/2103.14749v1
- Date: Fri, 26 Mar 2021 21:54:36 GMT
- Title: Pervasive Label Errors in Test Sets Destabilize Machine Learning
Benchmarks
- Authors: Curtis G. Northcutt, Anish Athalye, Jonas Mueller
- Abstract summary: We identify label errors in the test sets of 10 of the most commonly-used computer vision, natural language, and audio datasets.
We estimate an average of 3.4% errors across the 10 datasets, where for example 2916 label errors comprise 6% of the ImageNet validation set.
- Score: 12.992191397900806
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We identify label errors in the test sets of 10 of the most commonly-used
computer vision, natural language, and audio datasets, and subsequently study
the potential for these label errors to affect benchmark results. Errors in
test sets are numerous and widespread: we estimate an average of 3.4% errors
across the 10 datasets, where for example 2916 label errors comprise 6% of the
ImageNet validation set. Putative label errors are identified using confident
learning algorithms and then human-validated via crowdsourcing (54% of the
algorithmically-flagged candidates are indeed erroneously labeled).
Traditionally, machine learning practitioners choose which model to deploy
based on test accuracy - our findings advise caution here, proposing that
judging models over correctly labeled test sets may be more useful, especially
for noisy real-world datasets. Surprisingly, we find that lower capacity models
may be practically more useful than higher capacity models in real-world
datasets with high proportions of erroneously labeled data. For example, on
ImageNet with corrected labels: ResNet-18 outperforms ResNet50 if the
prevalence of originally mislabeled test examples increases by just 6%. On
CIFAR-10 with corrected labels: VGG-11 outperforms VGG-19 if the prevalence of
originally mislabeled test examples increases by just 5%.
Related papers
- R+R: Security Vulnerability Dataset Quality Is Critical [0.6906005491572401]
A number of studies have employed datasets that are plagued by high duplication rates, questionable label accuracy, and incomplete samples.
Our findings indicate that 56% of the samples had incorrect labels and 44% comprised incomplete samples--only 31% were both accurate and complete.
We employ transfer learning using a large deduplicated bugfix corpus to show that these models can exhibit better performance if given larger amounts of high-quality pre-training data.
arXiv Detail & Related papers (2025-03-09T01:49:30Z) - Enhancing Zero-Shot Vision Models by Label-Free Prompt Distribution Learning and Bias Correcting [55.361337202198925]
Vision-language models, such as CLIP, have shown impressive generalization capacities when using appropriate text descriptions.
We propose a label-Free prompt distribution learning and bias correction framework, dubbed as **Frolic**, which boosts zero-shot performance without the need for labeled data.
arXiv Detail & Related papers (2024-10-25T04:00:45Z) - Improving Label Error Detection and Elimination with Uncertainty Quantification [5.184615738004059]
We develop novel, model-agnostic algorithms for Uncertainty Quantification-Based Label Error Detection (UQ-LED)
Our UQ-LED algorithms outperform state-of-the-art confident learning in identifying label errors.
We propose a novel approach to generate realistic, class-dependent label errors synthetically.
arXiv Detail & Related papers (2024-05-15T15:17:52Z) - AQuA: A Benchmarking Tool for Label Quality Assessment [16.83510474053401]
Recent studies have found datasets widely used to train and evaluate machine learning models to have pervasive labeling errors.
We propose a benchmarking environment AQuA to rigorously evaluate methods that enable machine learning in the presence of label noise.
arXiv Detail & Related papers (2023-06-15T19:42:11Z) - Identifying Label Errors in Object Detection Datasets by Loss Inspection [4.442111891959355]
We introduce a benchmark for label error detection methods on object detection datasets.
We simulate four different types of randomly introduced label errors on train and test sets of well-labeled object detection datasets.
arXiv Detail & Related papers (2023-03-13T10:54:52Z) - CTRL: Clustering Training Losses for Label Error Detection [4.49681473359251]
In supervised machine learning, use of correct labels is extremely important to ensure high accuracy.
We propose a novel framework, calledClustering TRaining Losses for label error detection.
It detects label errors in two steps based on the observation that models learn clean and noisy labels in different ways.
arXiv Detail & Related papers (2022-08-17T18:09:19Z) - Active Learning by Feature Mixing [52.16150629234465]
We propose a novel method for batch active learning called ALFA-Mix.
We identify unlabelled instances with sufficiently-distinct features by seeking inconsistencies in predictions.
We show that inconsistencies in these predictions help discovering features that the model is unable to recognise in the unlabelled instances.
arXiv Detail & Related papers (2022-03-14T12:20:54Z) - Debiased Pseudo Labeling in Self-Training [77.83549261035277]
Deep neural networks achieve remarkable performances on a wide range of tasks with the aid of large-scale labeled datasets.
To mitigate the requirement for labeled data, self-training is widely used in both academia and industry by pseudo labeling on readily-available unlabeled data.
We propose Debiased, in which the generation and utilization of pseudo labels are decoupled by two independent heads.
arXiv Detail & Related papers (2022-02-15T02:14:33Z) - In Defense of Pseudo-Labeling: An Uncertainty-Aware Pseudo-label
Selection Framework for Semi-Supervised Learning [53.1047775185362]
Pseudo-labeling (PL) is a general SSL approach that does not have this constraint but performs relatively poorly in its original formulation.
We argue that PL underperforms due to the erroneous high confidence predictions from poorly calibrated models.
We propose an uncertainty-aware pseudo-label selection (UPS) framework which improves pseudo labeling accuracy by drastically reducing the amount of noise encountered in the training process.
arXiv Detail & Related papers (2021-01-15T23:29:57Z) - Big Self-Supervised Models are Strong Semi-Supervised Learners [116.00752519907725]
We show that it is surprisingly effective for semi-supervised learning on ImageNet.
A key ingredient of our approach is the use of big (deep and wide) networks during pretraining and fine-tuning.
We find that, the fewer the labels, the more this approach (task-agnostic use of unlabeled data) benefits from a bigger network.
arXiv Detail & Related papers (2020-06-17T17:48:22Z) - TACRED Revisited: A Thorough Evaluation of the TACRED Relation
Extraction Task [80.38130122127882]
TACRED is one of the largest, most widely used crowdsourced datasets in Relation Extraction (RE)
In this paper, we investigate the questions: Have we reached a performance ceiling or is there still room for improvement?
We find that label errors account for 8% absolute F1 test error, and that more than 50% of the examples need to be relabeled.
arXiv Detail & Related papers (2020-04-30T15:07:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.