EPiDA: An Easy Plug-in Data Augmentation Framework for High Performance
Text Classification
- URL: http://arxiv.org/abs/2204.11205v1
- Date: Sun, 24 Apr 2022 06:53:48 GMT
- Title: EPiDA: An Easy Plug-in Data Augmentation Framework for High Performance
Text Classification
- Authors: Minyi Zhao, Lu Zhang, Yi Xu, Jiandong Ding, Jihong Guan, Shuigeng Zhou
- Abstract summary: We present an easy and plug-in data augmentation framework EPiDA to support effective text classification.
EPiDA employs two mechanisms: relative entropy (REM) and conditional minimization entropy (CEM) to control data generation.
EPiDA can support efficient and continuous data generation for effective classification training.
- Score: 34.15923302216751
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Recent works have empirically shown the effectiveness of data augmentation
(DA) in NLP tasks, especially for those suffering from data scarcity.
Intuitively, given the size of generated data, their diversity and quality are
crucial to the performance of targeted tasks. However, to the best of our
knowledge, most existing methods consider only either the diversity or the
quality of augmented data, thus cannot fully mine the potential of DA for NLP.
In this paper, we present an easy and plug-in data augmentation framework EPiDA
to support effective text classification. EPiDA employs two mechanisms:
relative entropy maximization (REM) and conditional entropy minimization (CEM)
to control data generation, where REM is designed to enhance the diversity of
augmented data while CEM is exploited to ensure their semantic consistency.
EPiDA can support efficient and continuous data generation for effective
classifier training. Extensive experiments show that EPiDA outperforms existing
SOTA methods in most cases, though not using any agent networks or pre-trained
generation networks, and it works well with various DA algorithms and
classification models. Code is available at
https://github.com/zhaominyiz/EPiDA.
Related papers
- Generalized Group Data Attribution [28.056149996461286]
Data Attribution methods quantify the influence of individual training data points on model outputs.
Existing DA methods are often computationally intensive, limiting their applicability to large-scale machine learning models.
We introduce the Generalized Group Data Attribution (GGDA) framework, which computationally simplifies DA by attributing to groups of training points instead of individual ones.
arXiv Detail & Related papers (2024-10-13T17:51:21Z) - Beyond Efficiency: Molecular Data Pruning for Enhanced Generalization [30.738229850748137]
MolPeg is a Molecular data Pruning framework for enhanced Generalization.
It focuses on the source-free data pruning scenario, where data pruning is applied with pretrained models.
It consistently outperforms existing DP methods across four downstream tasks.
arXiv Detail & Related papers (2024-09-02T09:06:04Z) - ADLDA: A Method to Reduce the Harm of Data Distribution Shift in Data Augmentation [11.887799310374174]
This study introduces a novel data augmentation technique, ADLDA, aimed at mitigating the negative impact of data distribution shifts.
Experimental results demonstrate that ADLDA significantly enhances model performance across multiple datasets.
arXiv Detail & Related papers (2024-05-11T03:20:35Z) - DreamDA: Generative Data Augmentation with Diffusion Models [68.22440150419003]
This paper proposes a new classification-oriented framework DreamDA.
DreamDA generates diverse samples that adhere to the original data distribution by considering training images in the original data as seeds.
In addition, since the labels of the generated data may not align with the labels of their corresponding seed images, we introduce a self-training paradigm for generating pseudo labels.
arXiv Detail & Related papers (2024-03-19T15:04:35Z) - Importance-Aware Adaptive Dataset Distillation [53.79746115426363]
Development of deep learning models is enabled by the availability of large-scale datasets.
dataset distillation aims to synthesize a compact dataset that retains the essential information from the large original dataset.
We propose an importance-aware adaptive dataset distillation (IADD) method that can improve distillation performance.
arXiv Detail & Related papers (2024-01-29T03:29:39Z) - Data-Centric Long-Tailed Image Recognition [49.90107582624604]
Long-tail models exhibit a strong demand for high-quality data.
Data-centric approaches aim to enhance both the quantity and quality of data to improve model performance.
There is currently a lack of research into the underlying mechanisms explaining the effectiveness of information augmentation.
arXiv Detail & Related papers (2023-11-03T06:34:37Z) - Learning Better with Less: Effective Augmentation for Sample-Efficient
Visual Reinforcement Learning [57.83232242068982]
Data augmentation (DA) is a crucial technique for enhancing the sample efficiency of visual reinforcement learning (RL) algorithms.
It remains unclear which attributes of DA account for its effectiveness in achieving sample-efficient visual RL.
This work conducts comprehensive experiments to assess the impact of DA's attributes on its efficacy.
arXiv Detail & Related papers (2023-05-25T15:46:20Z) - Augmentation-Aware Self-Supervision for Data-Efficient GAN Training [68.81471633374393]
Training generative adversarial networks (GANs) with limited data is challenging because the discriminator is prone to overfitting.
We propose a novel augmentation-aware self-supervised discriminator that predicts the augmentation parameter of the augmented data.
We compare our method with state-of-the-art (SOTA) methods using the class-conditional BigGAN and unconditional StyleGAN2 architectures.
arXiv Detail & Related papers (2022-05-31T10:35:55Z) - Generalization in Reinforcement Learning by Soft Data Augmentation [11.752595047069505]
SOft Data Augmentation (SODA) is a method that decouples augmentation from policy learning.
We find SODA to significantly advance sample efficiency, generalization, and stability in training over state-of-the-art vision-based RL methods.
arXiv Detail & Related papers (2020-11-26T17:00:34Z) - CoDA: Contrast-enhanced and Diversity-promoting Data Augmentation for
Natural Language Understanding [67.61357003974153]
We propose a novel data augmentation framework dubbed CoDA.
CoDA synthesizes diverse and informative augmented examples by integrating multiple transformations organically.
A contrastive regularization objective is introduced to capture the global relationship among all the data samples.
arXiv Detail & Related papers (2020-10-16T23:57:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.