SelfMix: Robust Learning Against Textual Label Noise with Self-Mixup
Training
- URL: http://arxiv.org/abs/2210.04525v2
- Date: Tue, 11 Oct 2022 02:43:12 GMT
- Title: SelfMix: Robust Learning Against Textual Label Noise with Self-Mixup
Training
- Authors: Dan Qiao, Chenchen Dai, Yuyang Ding, Juntao Li, Qiang Chen, Wenliang
Chen, Min Zhang
- Abstract summary: SelfMix is a simple yet effective method to handle label noise in text classification tasks.
Our method utilizes the dropout mechanism on a single model to reduce the confirmation bias in self-training.
- Score: 15.877178854064708
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: The conventional success of textual classification relies on annotated data,
and the new paradigm of pre-trained language models (PLMs) still requires a few
labeled data for downstream tasks. However, in real-world applications, label
noise inevitably exists in training data, damaging the effectiveness,
robustness, and generalization of the models constructed on such data.
Recently, remarkable achievements have been made to mitigate this dilemma in
visual data, while only a few explore textual data. To fill this gap, we
present SelfMix, a simple yet effective method, to handle label noise in text
classification tasks. SelfMix uses the Gaussian Mixture Model to separate
samples and leverages semi-supervised learning. Unlike previous works requiring
multiple models, our method utilizes the dropout mechanism on a single model to
reduce the confirmation bias in self-training and introduces a textual-level
mixup training strategy. Experimental results on three text classification
benchmarks with different types of text show that the performance of our
proposed method outperforms these strong baselines designed for both textual
and visual data under different noise ratios and noise types. Our code is
available at \url{https://github.com/noise-learning/SelfMix}.
Related papers
- Vision-Language Models are Strong Noisy Label Detectors [76.07846780815794]
This paper presents a Denoising Fine-Tuning framework, called DeFT, for adapting vision-language models.
DeFT utilizes the robust alignment of textual and visual features pre-trained on millions of auxiliary image-text pairs to sieve out noisy labels.
Experimental results on seven synthetic and real-world noisy datasets validate the effectiveness of DeFT in both noisy label detection and image classification.
arXiv Detail & Related papers (2024-09-29T12:55:17Z) - Pre-Trained Vision-Language Models as Partial Annotators [40.89255396643592]
Pre-trained vision-language models learn massive data to model unified representations of images and natural languages.
In this paper, we investigate a novel "pre-trained annotating - weakly-supervised learning" paradigm for pre-trained model application and experiment on image classification tasks.
arXiv Detail & Related papers (2024-05-23T17:17:27Z) - Elevating Code-mixed Text Handling through Auditory Information of Words [24.53638976212391]
We propose an effective approach for creating language models for handling code-mixed textual data using auditory information of words from SOUNDEX.
Our approach includes a pre-training step based on masked-language-modelling, which includes SOUNDEX representations (SAMLM) and a new method of providing input data to the pre-trained model.
arXiv Detail & Related papers (2023-10-27T14:03:30Z) - Combating Label Noise With A General Surrogate Model For Sample
Selection [84.61367781175984]
We propose to leverage the vision-language surrogate model CLIP to filter noisy samples automatically.
We validate the effectiveness of our proposed method on both real-world and synthetic noisy datasets.
arXiv Detail & Related papers (2023-10-16T14:43:27Z) - Self-Evolution Learning for Mixup: Enhance Data Augmentation on Few-Shot
Text Classification Tasks [75.42002070547267]
We propose a self evolution learning (SE) based mixup approach for data augmentation in text classification.
We introduce a novel instance specific label smoothing approach, which linearly interpolates the model's output and one hot labels of the original samples to generate new soft for label mixing up.
arXiv Detail & Related papers (2023-05-22T23:43:23Z) - Learning to Detect Noisy Labels Using Model-Based Features [16.681748918518075]
We propose Selection-Enhanced Noisy label Training (SENT)
SENT does not rely on meta learning while having the flexibility of being data-driven.
It improves performance over strong baselines under the settings of self-training and label corruption.
arXiv Detail & Related papers (2022-12-28T10:12:13Z) - DoubleMix: Simple Interpolation-Based Data Augmentation for Text
Classification [56.817386699291305]
This paper proposes a simple yet effective data augmentation approach termed DoubleMix.
DoubleMix first generates several perturbed samples for each training data.
It then uses the perturbed data and original data to carry out a two-step in the hidden space of neural models.
arXiv Detail & Related papers (2022-09-12T15:01:04Z) - Label-Noise Learning with Intrinsically Long-Tailed Data [65.41318436799993]
We propose a learning framework for label-noise learning with intrinsically long-tailed data.
Specifically, we propose two-stage bi-dimensional sample selection (TABASCO) to better separate clean samples from noisy samples.
arXiv Detail & Related papers (2022-08-21T07:47:05Z) - DivideMix: Learning with Noisy Labels as Semi-supervised Learning [111.03364864022261]
We propose DivideMix, a framework for learning with noisy labels.
Experiments on multiple benchmark datasets demonstrate substantial improvements over state-of-the-art methods.
arXiv Detail & Related papers (2020-02-18T06:20:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.