Active label cleaning: Improving dataset quality under resource
constraints
- URL: http://arxiv.org/abs/2109.00574v1
- Date: Wed, 1 Sep 2021 19:03:57 GMT
- Title: Active label cleaning: Improving dataset quality under resource
constraints
- Authors: Melanie Bernhardt, Daniel C. Castro, Ryutaro Tanno, Anton
Schwaighofer, Kerem C. Tezcan, Miguel Monteiro, Shruthi Bannur, Matthew
Lungren, Aditya Nori, Ben Glocker, Javier Alvarez-Valle, Ozan Oktay
- Abstract summary: Imperfections in data annotation, known as label noise, are detrimental to the training of machine learning models.
This work advocates for a data-driven approach to prioritising samples for re-annotation.
We rank instances according to estimated label correctness and labelling difficulty of each sample, and introduce a simulation framework to evaluate relabelling efficacy.
- Score: 13.716577886649018
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Imperfections in data annotation, known as label noise, are detrimental to
the training of machine learning models and have an often-overlooked
confounding effect on the assessment of model performance. Nevertheless,
employing experts to remove label noise by fully re-annotating large datasets
is infeasible in resource-constrained settings, such as healthcare. This work
advocates for a data-driven approach to prioritising samples for re-annotation
- which we term "active label cleaning". We propose to rank instances according
to estimated label correctness and labelling difficulty of each sample, and
introduce a simulation framework to evaluate relabelling efficacy. Our
experiments on natural images and on a new medical imaging benchmark show that
cleaning noisy labels mitigates their negative impact on model training,
evaluation, and selection. Crucially, the proposed active label cleaning
enables correcting labels up to 4 times more effectively than typical random
selection in realistic conditions, making better use of experts' valuable time
for improving dataset quality.
Related papers
- Extracting Clean and Balanced Subset for Noisy Long-tailed Classification [66.47809135771698]
We develop a novel pseudo labeling method using class prototypes from the perspective of distribution matching.
By setting a manually-specific probability measure, we can reduce the side-effects of noisy and long-tailed data simultaneously.
Our method can extract this class-balanced subset with clean labels, which brings effective performance gains for long-tailed classification with label noise.
arXiv Detail & Related papers (2024-04-10T07:34:37Z) - Learning with Imbalanced Noisy Data by Preventing Bias in Sample
Selection [82.43311784594384]
Real-world datasets contain not only noisy labels but also class imbalance.
We propose a simple yet effective method to address noisy labels in imbalanced datasets.
arXiv Detail & Related papers (2024-02-17T10:34:53Z) - Soft Curriculum for Learning Conditional GANs with Noisy-Labeled and
Uncurated Unlabeled Data [70.25049762295193]
We introduce a novel conditional image generation framework that accepts noisy-labeled and uncurated data during training.
We propose soft curriculum learning, which assigns instance-wise weights for adversarial training while assigning new labels for unlabeled data.
Our experiments show that our approach outperforms existing semi-supervised and label-noise robust methods in terms of both quantitative and qualitative performance.
arXiv Detail & Related papers (2023-07-17T08:31:59Z) - Debiased Pseudo Labeling in Self-Training [77.83549261035277]
Deep neural networks achieve remarkable performances on a wide range of tasks with the aid of large-scale labeled datasets.
To mitigate the requirement for labeled data, self-training is widely used in both academia and industry by pseudo labeling on readily-available unlabeled data.
We propose Debiased, in which the generation and utilization of pseudo labels are decoupled by two independent heads.
arXiv Detail & Related papers (2022-02-15T02:14:33Z) - Learning to Aggregate and Refine Noisy Labels for Visual Sentiment
Analysis [69.48582264712854]
We propose a robust learning method to perform robust visual sentiment analysis.
Our method relies on an external memory to aggregate and filter noisy labels during training.
We establish a benchmark for visual sentiment analysis with label noise using publicly available datasets.
arXiv Detail & Related papers (2021-09-15T18:18:28Z) - Tackling Instance-Dependent Label Noise via a Universal Probabilistic
Model [80.91927573604438]
This paper proposes a simple yet universal probabilistic model, which explicitly relates noisy labels to their instances.
Experiments on datasets with both synthetic and real-world label noise verify that the proposed method yields significant improvements on robustness.
arXiv Detail & Related papers (2021-01-14T05:43:51Z) - Improving Generalization of Deep Fault Detection Models in the Presence
of Mislabeled Data [1.3535770763481902]
We propose a novel two-step framework for robust training with label noise.
In the first step, we identify outliers (including the mislabeled samples) based on the update in the hypothesis space.
In the second step, we propose different approaches to modifying the training data based on the identified outliers and a data augmentation technique.
arXiv Detail & Related papers (2020-09-30T12:33:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.