Related papers: DoubleMix: Simple Interpolation-Based Data Augmentation for Text Classification

DoubleMix: Simple Interpolation-Based Data Augmentation for Text Classification

URL: http://arxiv.org/abs/2209.05297v1
Date: Mon, 12 Sep 2022 15:01:04 GMT
Title: DoubleMix: Simple Interpolation-Based Data Augmentation for Text Classification
Authors: Hui Chen, Wei Han, Diyi Yang, Soujanya Poria
Abstract summary: This paper proposes a simple yet effective data augmentation approach termed DoubleMix. DoubleMix first generates several perturbed samples for each training data. It then uses the perturbed data and original data to carry out a two-step in the hidden space of neural models.
Score: 56.817386699291305
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This paper proposes a simple yet effective interpolation-based data augmentation approach termed DoubleMix, to improve the robustness of models in text classification. DoubleMix first leverages a couple of simple augmentation operations to generate several perturbed samples for each training data, and then uses the perturbed data and original data to carry out a two-step interpolation in the hidden space of neural models. Concretely, it first mixes up the perturbed data to a synthetic sample and then mixes up the original data and the synthetic perturbed data. DoubleMix enhances models' robustness by learning the "shifted" features in hidden space. On six text classification benchmark datasets, our approach outperforms several popular text augmentation methods including token-level, sentence-level, and hidden-level data augmentation techniques. Also, experiments in low-resource settings show our approach consistently improves models' performance when the training data is scarce. Extensive ablation studies and case studies confirm that each component of our approach contributes to the final performance and show that our approach exhibits superior performance on challenging counterexamples. Additionally, visual analysis shows that text features generated by our approach are highly interpretable. Our code for this paper can be found at https://github.com/declare-lab/DoubleMix.git.

Related papers

CLIMB: CLustering-based Iterative Data Mixture Bootstrapping for Language Model Pre-training [63.07024608399447]
We propose an automated framework that discovers, evaluates, and refines data mixtures in a pre-training setting. We introduce ClimbLab, a filtered 1.2-trillion-token corpus with 20 clusters as a research playground, and ClimbMix, a compact yet powerful 400-billion-token dataset.
arXiv Detail & Related papers (2025-04-17T17:58:13Z)
TransformMix: Learning Transformation and Mixing Strategies from Data [20.79680733590554]
We propose an automated approach, TransformMix, to learn better transformation and mixing augmentation strategies from data. We demonstrate the effectiveness of TransformMix on multiple datasets in transfer learning, classification, object detection, and knowledge distillation settings.
arXiv Detail & Related papers (2024-03-19T04:36:41Z)
Manifold DivideMix: A Semi-Supervised Contrastive Learning Framework for Severe Label Noise [4.90148689564172]
Real-world datasets contain noisy label samples that have no semantic relevance to any class in the dataset. Most state-of-the-art methods leverage ID labeled noisy samples as unlabeled data for semi-supervised learning. We propose incorporating the information from all the training data by leveraging the benefits of self-supervised training.
arXiv Detail & Related papers (2023-08-13T23:33:33Z)
Self-Evolution Learning for Mixup: Enhance Data Augmentation on Few-Shot Text Classification Tasks [75.42002070547267]
We propose a self evolution learning (SE) based mixup approach for data augmentation in text classification. We introduce a novel instance specific label smoothing approach, which linearly interpolates the model's output and one hot labels of the original samples to generate new soft for label mixing up.
arXiv Detail & Related papers (2023-05-22T23:43:23Z)
Feature Weaken: Vicinal Data Augmentation for Classification [1.7013938542585925]
We use Feature Weaken to construct the vicinal data distribution with the same cosine similarity for model training. This work can not only improve the classification performance and generalization of the model, but also stabilize the model training and accelerate the model convergence.
arXiv Detail & Related papers (2022-11-20T11:00:23Z)
Dataset Distillation via Factorization [58.8114016318593]
We introduce a emphdataset factorization approach, termed emphHaBa, which is a plug-and-play strategy portable to any existing dataset distillation (DD) baseline. emphHaBa explores decomposing a dataset into two components: data emphHallucination networks and emphBases. Our method can yield significant improvement on downstream classification tasks compared with previous state of the arts, while reducing the total number of compressed parameters by up to 65%.
arXiv Detail & Related papers (2022-10-30T08:36:19Z)
SelfMix: Robust Learning Against Textual Label Noise with Self-Mixup Training [15.877178854064708]
SelfMix is a simple yet effective method to handle label noise in text classification tasks. Our method utilizes the dropout mechanism on a single model to reduce the confirmation bias in self-training.
arXiv Detail & Related papers (2022-10-10T09:46:40Z)
Harnessing Hard Mixed Samples with Decoupled Regularizer [69.98746081734441]
Mixup is an efficient data augmentation approach that improves the generalization of neural networks by smoothing the decision boundary with mixed data. In this paper, we propose an efficient mixup objective function with a decoupled regularizer named Decoupled Mixup (DM) DM can adaptively utilize hard mixed samples to mine discriminative features without losing the original smoothness of mixup.
arXiv Detail & Related papers (2022-03-21T07:12:18Z)
Sequence-Level Mixed Sample Data Augmentation [119.94667752029143]
This work proposes a simple data augmentation approach to encourage compositional behavior in neural models for sequence-to-sequence problems. Our approach, SeqMix, creates new synthetic examples by softly combining input/output sequences from the training set.
arXiv Detail & Related papers (2020-11-18T02:18:04Z)
Mixup-Transformer: Dynamic Data Augmentation for NLP Tasks [75.69896269357005]
Mixup is the latest data augmentation technique that linearly interpolates input examples and the corresponding labels. In this paper, we explore how to apply mixup to natural language processing tasks. We incorporate mixup to transformer-based pre-trained architecture, named "mixup-transformer", for a wide range of NLP tasks.
arXiv Detail & Related papers (2020-10-05T23:37:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.