Inconsistency Ranking-based Noisy Label Detection for High-quality Data
- URL: http://arxiv.org/abs/2212.00239v2
- Date: Thu, 15 Jun 2023 14:08:55 GMT
- Title: Inconsistency Ranking-based Noisy Label Detection for High-quality Data
- Authors: Ruibin Yuan, Hanzhi Yin, Yi Wang, Yifan He, Yushi Ye, Lei Zhang,
Zhizheng Wu
- Abstract summary: This paper proposes an automatic noisy label detection (NLD) technique with inconsistency ranking for high-quality data.
We investigate both inter-class and intra-class inconsistency ranking and compare several metric learning loss functions under different noise settings.
Experimental results confirm that the proposed solution could increase both the efficient and effective cleaning of large-scale speaker recognition datasets.
- Score: 11.844624139434867
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The success of deep learning requires high-quality annotated and massive
data. However, the size and the quality of a dataset are usually a trade-off in
practice, as data collection and cleaning are expensive and time-consuming. In
real-world applications, especially those using crowdsourcing datasets, it is
important to exclude noisy labels. To address this, this paper proposes an
automatic noisy label detection (NLD) technique with inconsistency ranking for
high-quality data. We apply this technique to the automatic speaker
verification (ASV) task as a proof of concept. We investigate both inter-class
and intra-class inconsistency ranking and compare several metric learning loss
functions under different noise settings. Experimental results confirm that the
proposed solution could increase both the efficient and effective cleaning of
large-scale speaker recognition datasets.
Related papers
- Automatic Dataset Construction (ADC): Sample Collection, Data Curation, and Beyond [38.89457061559469]
We propose an innovative methodology that automates dataset creation with negligible cost and high efficiency.
We provide open-source software that incorporates existing methods for label error detection, robust learning under noisy and biased data.
We design three benchmark datasets focused on label noise detection, label noise learning, and class-imbalanced learning.
arXiv Detail & Related papers (2024-08-21T04:45:12Z) - Data Valuation with Gradient Similarity [1.997283751398032]
Data Valuation algorithms quantify the value of each sample in a dataset based on its contribution or importance to a given predictive task.
We present a simple alternative to existing methods, termed Data Valuation with Gradient Similarity (DVGS)
Our approach has the ability to rapidly and accurately identify low-quality data, which can reduce the need for expert knowledge and manual intervention in data cleaning tasks.
arXiv Detail & Related papers (2024-05-13T22:10:00Z) - Improving a Named Entity Recognizer Trained on Noisy Data with a Few
Clean Instances [55.37242480995541]
We propose to denoise noisy NER data with guidance from a small set of clean instances.
Along with the main NER model we train a discriminator model and use its outputs to recalibrate the sample weights.
Results on public crowdsourcing and distant supervision datasets show that the proposed method can consistently improve performance with a small guidance set.
arXiv Detail & Related papers (2023-10-25T17:23:37Z) - Learning to Detect Noisy Labels Using Model-Based Features [16.681748918518075]
We propose Selection-Enhanced Noisy label Training (SENT)
SENT does not rely on meta learning while having the flexibility of being data-driven.
It improves performance over strong baselines under the settings of self-training and label corruption.
arXiv Detail & Related papers (2022-12-28T10:12:13Z) - On-the-fly Denoising for Data Augmentation in Natural Language
Understanding [101.46848743193358]
We propose an on-the-fly denoising technique for data augmentation that learns from soft augmented labels provided by an organic teacher model trained on the cleaner original data.
Our method can be applied to general augmentation techniques and consistently improve the performance on both text classification and question-answering tasks.
arXiv Detail & Related papers (2022-12-20T18:58:33Z) - Representation Learning for the Automatic Indexing of Sound Effects
Libraries [79.68916470119743]
We show that a task-specific but dataset-independent representation can successfully address data issues such as class imbalance, inconsistent class labels, and insufficient dataset size.
Detailed experimental results show the impact of metric learning approaches and different cross-dataset training methods on representational effectiveness.
arXiv Detail & Related papers (2022-08-18T23:46:13Z) - Noise-resistant Deep Metric Learning with Ranking-based Instance
Selection [59.286567680389766]
We propose a noise-resistant training technique for DML, which we name Probabilistic Ranking-based Instance Selection with Memory (PRISM)
PRISM identifies noisy data in a minibatch using average similarity against image features extracted from several previous versions of the neural network.
To alleviate the high computational cost brought by the memory bank, we introduce an acceleration method that replaces individual data points with the class centers.
arXiv Detail & Related papers (2021-03-30T03:22:17Z) - Tackling Instance-Dependent Label Noise via a Universal Probabilistic
Model [80.91927573604438]
This paper proposes a simple yet universal probabilistic model, which explicitly relates noisy labels to their instances.
Experiments on datasets with both synthetic and real-world label noise verify that the proposed method yields significant improvements on robustness.
arXiv Detail & Related papers (2021-01-14T05:43:51Z) - Deep Learning from Small Amount of Medical Data with Noisy Labels: A
Meta-Learning Approach [0.0]
Computer vision systems require correctly labeled large datasets in order to be trained properly.
Medical imaging datasets are commonly tiny, which makes each data very important in learning.
A label-noise-robust learning algorithm that makes use of the meta-learning paradigm is proposed in this article.
arXiv Detail & Related papers (2020-10-14T10:39:44Z) - Audio Tagging by Cross Filtering Noisy Labels [26.14064793686316]
We present a novel framework, named CrossFilter, to combat the noisy labels problem for audio tagging.
Our method achieves state-of-the-art performance and even surpasses the ensemble models.
arXiv Detail & Related papers (2020-07-16T07:55:04Z) - Learning with Out-of-Distribution Data for Audio Classification [60.48251022280506]
We show that detecting and relabelling certain OOD instances, rather than discarding them, can have a positive effect on learning.
The proposed method is shown to improve the performance of convolutional neural networks by a significant margin.
arXiv Detail & Related papers (2020-02-11T21:08:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.