Assessing the Quality of the Datasets by Identifying Mislabeled Samples
- URL: http://arxiv.org/abs/2109.05000v1
- Date: Fri, 10 Sep 2021 17:14:09 GMT
- Title: Assessing the Quality of the Datasets by Identifying Mislabeled Samples
- Authors: Vaibhav Pulastya, Gaurav Nuti, Yash Kumar Atri, Tanmoy Chakraborty
- Abstract summary: We propose a novel statistic -- noise score -- as a measure for the quality of each data point to identify mislabeled samples.
In our work, we use the representations derived by the inference network of data quality supervised variational autoencoder (AQUAVS)
We validate our proposed statistic through experimentation by corrupting MNIST, FashionMNIST, and CIFAR10/100 datasets.
- Score: 14.881597737762316
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Due to the over-emphasize of the quantity of data, the data quality has often
been overlooked. However, not all training data points contribute equally to
learning. In particular, if mislabeled, it might actively damage the
performance of the model and the ability to generalize out of distribution, as
the model might end up learning spurious artifacts present in the dataset. This
problem gets compounded by the prevalence of heavily parameterized and complex
deep neural networks, which can, with their high capacity, end up memorizing
the noise present in the dataset. This paper proposes a novel statistic --
noise score, as a measure for the quality of each data point to identify such
mislabeled samples based on the variations in the latent space representation.
In our work, we use the representations derived by the inference network of
data quality supervised variational autoencoder (AQUAVS). Our method leverages
the fact that samples belonging to the same class will have similar latent
representations. Therefore, by identifying the outliers in the latent space, we
can find the mislabeled samples. We validate our proposed statistic through
experimentation by corrupting MNIST, FashionMNIST, and CIFAR10/100 datasets in
different noise settings for the task of identifying mislabelled samples. We
further show significant improvements in accuracy for the classification task
for each dataset.
Related papers
- Downstream-Pretext Domain Knowledge Traceback for Active Learning [138.02530777915362]
We propose a downstream-pretext domain knowledge traceback (DOKT) method that traces the data interactions of downstream knowledge and pre-training guidance.
DOKT consists of a traceback diversity indicator and a domain-based uncertainty estimator.
Experiments conducted on ten datasets show that our model outperforms other state-of-the-art methods.
arXiv Detail & Related papers (2024-07-20T01:34:13Z) - Learning with Imbalanced Noisy Data by Preventing Bias in Sample
Selection [82.43311784594384]
Real-world datasets contain not only noisy labels but also class imbalance.
We propose a simple yet effective method to address noisy labels in imbalanced datasets.
arXiv Detail & Related papers (2024-02-17T10:34:53Z) - On the Impact of Data Quality on Image Classification Fairness [11.329873246415797]
We measure key fairness metrics across a range of algorithms over multiple image classification datasets.
We describe noise in the labels as inaccuracies in the labelling of the data in the training set and noise in the data as distortions in the data.
By adding noise to the original datasets, we can explore the relationship between the quality of the training data and the fairness of the output of the models trained on that data.
arXiv Detail & Related papers (2023-05-02T16:54:23Z) - Learning from Training Dynamics: Identifying Mislabeled Data Beyond
Manually Designed Features [43.41573458276422]
We introduce a novel learning-based solution, leveraging a noise detector, instanced by an LSTM network.
The proposed method trains the noise detector in a supervised manner using the dataset with synthesized label noises.
Results show that the proposed method precisely detects mislabeled samples on various datasets without further adaptation.
arXiv Detail & Related papers (2022-12-19T09:39:30Z) - Combating noisy labels in object detection datasets [0.0]
We introduce the Confident Learning for Object Detection (CLOD) algorithm for assessing the quality of each label in object detection datasets.
We identify missing, spurious, mislabeled, and mislocated bounding boxes and suggesting corrections.
The proposed method is able to point out nearly 80% of artificially disturbed bounding boxes with a false positive rate below 0.1.
arXiv Detail & Related papers (2022-11-25T10:05:06Z) - S3: Supervised Self-supervised Learning under Label Noise [53.02249460567745]
In this paper we address the problem of classification in the presence of label noise.
In the heart of our method is a sample selection mechanism that relies on the consistency between the annotated label of a sample and the distribution of the labels in its neighborhood in the feature space.
Our method significantly surpasses previous methods on both CIFARCIFAR100 with artificial noise and real-world noisy datasets such as WebVision and ANIMAL-10N.
arXiv Detail & Related papers (2021-11-22T15:49:20Z) - Tackling Instance-Dependent Label Noise via a Universal Probabilistic
Model [80.91927573604438]
This paper proposes a simple yet universal probabilistic model, which explicitly relates noisy labels to their instances.
Experiments on datasets with both synthetic and real-world label noise verify that the proposed method yields significant improvements on robustness.
arXiv Detail & Related papers (2021-01-14T05:43:51Z) - Exploiting Sample Uncertainty for Domain Adaptive Person
Re-Identification [137.9939571408506]
We estimate and exploit the credibility of the assigned pseudo-label of each sample to alleviate the influence of noisy labels.
Our uncertainty-guided optimization brings significant improvement and achieves the state-of-the-art performance on benchmark datasets.
arXiv Detail & Related papers (2020-12-16T04:09:04Z) - Improving Generalization of Deep Fault Detection Models in the Presence
of Mislabeled Data [1.3535770763481902]
We propose a novel two-step framework for robust training with label noise.
In the first step, we identify outliers (including the mislabeled samples) based on the update in the hypothesis space.
In the second step, we propose different approaches to modifying the training data based on the identified outliers and a data augmentation technique.
arXiv Detail & Related papers (2020-09-30T12:33:25Z) - On the Role of Dataset Quality and Heterogeneity in Model Confidence [27.657631193015252]
Safety-critical applications require machine learning models that output accurate and calibrated probabilities.
Uncalibrated deep networks are known to make over-confident predictions.
We study the impact of dataset quality by studying the impact of dataset size and the label noise on the model confidence.
arXiv Detail & Related papers (2020-02-23T05:13:12Z) - Learning with Out-of-Distribution Data for Audio Classification [60.48251022280506]
We show that detecting and relabelling certain OOD instances, rather than discarding them, can have a positive effect on learning.
The proposed method is shown to improve the performance of convolutional neural networks by a significant margin.
arXiv Detail & Related papers (2020-02-11T21:08:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.