Handling Realistic Label Noise in BERT Text Classification
- URL: http://arxiv.org/abs/2305.16337v2
- Date: Fri, 20 Oct 2023 11:26:43 GMT
- Title: Handling Realistic Label Noise in BERT Text Classification
- Authors: Maha Tufail Agro, Hanan Aldarmaki
- Abstract summary: Real label noise is not random; rather, it is often correlated with input features or other annotator-specific factors.
We show that the presence of these types of noise significantly degrades BERT classification performance.
- Score: 1.0515439489916731
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Labels noise refers to errors in training labels caused by cheap data
annotation methods, such as web scraping or crowd-sourcing, which can be
detrimental to the performance of supervised classifiers. Several methods have
been proposed to counteract the effect of random label noise in supervised
classification, and some studies have shown that BERT is already robust against
high rates of randomly injected label noise. However, real label noise is not
random; rather, it is often correlated with input features or other
annotator-specific factors. In this paper, we evaluate BERT in the presence of
two types of realistic label noise: feature-dependent label noise, and
synthetic label noise from annotator disagreements. We show that the presence
of these types of noise significantly degrades BERT classification performance.
To improve robustness, we evaluate different types of ensembles and
noise-cleaning methods and compare their effectiveness against label noise
across different datasets.
Related papers
- Training Gradient Boosted Decision Trees on Tabular Data Containing Label Noise for Classification Tasks [1.261491746208123]
This study aims to investigate the effects of label noise on gradient-boosted decision trees and methods to mitigate those effects.
The implemented methods demonstrate state-of-the-art noise detection performance on the Adult dataset and achieve the highest classification precision and recall on the Adult and Breast Cancer datasets.
arXiv Detail & Related papers (2024-09-13T09:09:24Z) - Extracting Clean and Balanced Subset for Noisy Long-tailed Classification [66.47809135771698]
We develop a novel pseudo labeling method using class prototypes from the perspective of distribution matching.
By setting a manually-specific probability measure, we can reduce the side-effects of noisy and long-tailed data simultaneously.
Our method can extract this class-balanced subset with clean labels, which brings effective performance gains for long-tailed classification with label noise.
arXiv Detail & Related papers (2024-04-10T07:34:37Z) - Learning from Time Series under Temporal Label Noise [23.39598516168891]
We first propose and formalize temporal label noise, an unstudied problem for sequential classification of time series.
We show that our methods lead to state-of-the-art performance in the presence of diverse temporal label noise functions using real and synthetic data.
arXiv Detail & Related papers (2024-02-06T20:56:31Z) - Learning to Correct Noisy Labels for Fine-Grained Entity Typing via
Co-Prediction Prompt Tuning [9.885278527023532]
We introduce Co-Prediction Prompt Tuning for noise correction in FET.
We integrate prediction results to recall labeled labels and utilize a differentiated margin to identify inaccurate labels.
Experimental results on three widely-used FET datasets demonstrate that our noise correction approach significantly enhances the quality of training samples.
arXiv Detail & Related papers (2023-10-23T06:04:07Z) - Generating the Ground Truth: Synthetic Data for Soft Label and Label Noise Research [0.0]
We introduce SYNLABEL, a framework designed to create noiseless datasets informed by real-world data.
We demonstrate its ability to precisely quantify label noise and its improvement over existing methodologies.
arXiv Detail & Related papers (2023-09-08T13:31:06Z) - S3: Supervised Self-supervised Learning under Label Noise [53.02249460567745]
In this paper we address the problem of classification in the presence of label noise.
In the heart of our method is a sample selection mechanism that relies on the consistency between the annotated label of a sample and the distribution of the labels in its neighborhood in the feature space.
Our method significantly surpasses previous methods on both CIFARCIFAR100 with artificial noise and real-world noisy datasets such as WebVision and ANIMAL-10N.
arXiv Detail & Related papers (2021-11-22T15:49:20Z) - Label Noise in Adversarial Training: A Novel Perspective to Study Robust
Overfitting [45.58217741522973]
We show that label noise exists in adversarial training.
Such label noise is due to the mismatch between the true label distribution of adversarial examples and the label inherited from clean examples.
We propose a method to automatically calibrate the label to address the label noise and robust overfitting.
arXiv Detail & Related papers (2021-10-07T01:15:06Z) - Improving Medical Image Classification with Label Noise Using
Dual-uncertainty Estimation [72.0276067144762]
We discuss and define the two common types of label noise in medical images.
We propose an uncertainty estimation-based framework to handle these two label noise amid the medical image classification task.
arXiv Detail & Related papers (2021-02-28T14:56:45Z) - Tackling Instance-Dependent Label Noise via a Universal Probabilistic
Model [80.91927573604438]
This paper proposes a simple yet universal probabilistic model, which explicitly relates noisy labels to their instances.
Experiments on datasets with both synthetic and real-world label noise verify that the proposed method yields significant improvements on robustness.
arXiv Detail & Related papers (2021-01-14T05:43:51Z) - A Second-Order Approach to Learning with Instance-Dependent Label Noise [58.555527517928596]
The presence of label noise often misleads the training of deep neural networks.
We show that the errors in human-annotated labels are more likely to be dependent on the difficulty levels of tasks.
arXiv Detail & Related papers (2020-12-22T06:36:58Z) - Class2Simi: A Noise Reduction Perspective on Learning with Noisy Labels [98.13491369929798]
We propose a framework called Class2Simi, which transforms data points with noisy class labels to data pairs with noisy similarity labels.
Class2Simi is computationally efficient because not only this transformation is on-the-fly in mini-batches, but also it just changes loss on top of model prediction into a pairwise manner.
arXiv Detail & Related papers (2020-06-14T07:55:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.