Identifying Mislabeled Data using the Area Under the Margin Ranking
- URL: http://arxiv.org/abs/2001.10528v4
- Date: Wed, 23 Dec 2020 14:01:54 GMT
- Title: Identifying Mislabeled Data using the Area Under the Margin Ranking
- Authors: Geoff Pleiss, Tianyi Zhang, Ethan R. Elenberg, Kilian Q. Weinberger
- Abstract summary: This paper introduces a new method to identify such samples and mitigate their impact when training neural networks.
A simple procedure - adding an extra class populated with purposefully mislabeled threshold samples - learns a AUM upper bound that isolates mislabeled data.
On the WebVision50 classification task our method removes 17% of training data, yielding a 1.6% (absolute) improvement in test error.
- Score: 35.57623165270438
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Not all data in a typical training set help with generalization; some samples
can be overly ambiguous or outrightly mislabeled. This paper introduces a new
method to identify such samples and mitigate their impact when training neural
networks. At the heart of our algorithm is the Area Under the Margin (AUM)
statistic, which exploits differences in the training dynamics of clean and
mislabeled samples. A simple procedure - adding an extra class populated with
purposefully mislabeled threshold samples - learns a AUM upper bound that
isolates mislabeled data. This approach consistently improves upon prior work
on synthetic and real-world datasets. On the WebVision50 classification task
our method removes 17% of training data, yielding a 1.6% (absolute) improvement
in test error. On CIFAR100 removing 13% of the data leads to a 1.2% drop in
error.
Related papers
- MyriadAL: Active Few Shot Learning for Histopathology [10.652626309100889]
We introduce an active few shot learning framework, Myriad Active Learning (MAL)
MAL includes a contrastive-learning encoder, pseudo-label generation, and novel query sample selection in the loop.
Experiments on two public histopathology datasets show that MAL has superior test accuracy, macro F1-score, and label efficiency compared to prior works.
arXiv Detail & Related papers (2023-10-24T20:08:15Z) - Boosting Semi-Supervised Learning by bridging high and low-confidence
predictions [4.18804572788063]
Pseudo-labeling is a crucial technique in semi-supervised learning (SSL)
We propose a new method called ReFixMatch, which aims to utilize all of the unlabeled data during training.
arXiv Detail & Related papers (2023-08-15T00:27:18Z) - Impact of Strategic Sampling and Supervision Policies on Semi-supervised Learning [23.4909421082857]
In semi-supervised representation learning frameworks, when the number of labelled data is very scarce, the quality and representativeness of these samples become increasingly important.
Existing literature on semi-supervised learning randomly sample a limited number of data points for labelling.
All these labelled samples are then used along with the unlabelled data throughout the training process.
arXiv Detail & Related papers (2022-11-27T18:29:54Z) - UNICON: Combating Label Noise Through Uniform Selection and Contrastive
Learning [89.56465237941013]
We propose UNICON, a simple yet effective sample selection method which is robust to high label noise.
We obtain an 11.4% improvement over the current state-of-the-art on CIFAR100 dataset with a 90% noise rate.
arXiv Detail & Related papers (2022-03-28T07:36:36Z) - An analysis of over-sampling labeled data in semi-supervised learning
with FixMatch [66.34968300128631]
Most semi-supervised learning methods over-sample labeled data when constructing training mini-batches.
This paper studies whether this common practice improves learning and how.
We compare it to an alternative setting where each mini-batch is uniformly sampled from all the training data, labeled or not.
arXiv Detail & Related papers (2022-01-03T12:22:26Z) - Self-Trained One-class Classification for Unsupervised Anomaly Detection [56.35424872736276]
Anomaly detection (AD) has various applications across domains, from manufacturing to healthcare.
In this work, we focus on unsupervised AD problems whose entire training data are unlabeled and may contain both normal and anomalous samples.
To tackle this problem, we build a robust one-class classification framework via data refinement.
We show that our method outperforms state-of-the-art one-class classification method by 6.3 AUC and 12.5 average precision.
arXiv Detail & Related papers (2021-06-11T01:36:08Z) - Unbiased Teacher for Semi-Supervised Object Detection [50.0087227400306]
We revisit the Semi-Supervised Object Detection (SS-OD) and identify the pseudo-labeling bias issue in SS-OD.
We introduce Unbiased Teacher, a simple yet effective approach that jointly trains a student and a gradually progressing teacher in a mutually-beneficial manner.
arXiv Detail & Related papers (2021-02-18T17:02:57Z) - Improving Generalization of Deep Fault Detection Models in the Presence
of Mislabeled Data [1.3535770763481902]
We propose a novel two-step framework for robust training with label noise.
In the first step, we identify outliers (including the mislabeled samples) based on the update in the hypothesis space.
In the second step, we propose different approaches to modifying the training data based on the identified outliers and a data augmentation technique.
arXiv Detail & Related papers (2020-09-30T12:33:25Z) - Uncertainty-aware Self-training for Text Classification with Few Labels [54.13279574908808]
We study self-training as one of the earliest semi-supervised learning approaches to reduce the annotation bottleneck.
We propose an approach to improve self-training by incorporating uncertainty estimates of the underlying neural network.
We show our methods leveraging only 20-30 labeled samples per class for each task for training and for validation can perform within 3% of fully supervised pre-trained language models.
arXiv Detail & Related papers (2020-06-27T08:13:58Z) - Learning with Out-of-Distribution Data for Audio Classification [60.48251022280506]
We show that detecting and relabelling certain OOD instances, rather than discarding them, can have a positive effect on learning.
The proposed method is shown to improve the performance of convolutional neural networks by a significant margin.
arXiv Detail & Related papers (2020-02-11T21:08:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.