Data Augmentation is Dead, Long Live Data Augmentation
- URL: http://arxiv.org/abs/2402.14895v1
- Date: Thu, 22 Feb 2024 16:42:37 GMT
- Title: Data Augmentation is Dead, Long Live Data Augmentation
- Authors: Fr\'ed\'eric Piedboeuf and Philippe Langlais
- Abstract summary: We show that classical data augmentation is simply a way of performing better fine-tuning.
We furthermore show that zero and few-shot data generation via conversational agents such as ChatGPT or LLama2 can increase performances.
- Score: 9.84721698045897
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Textual data augmentation (DA) is a prolific field of study where novel
techniques to create artificial data are regularly proposed, and that has
demonstrated great efficiency on small data settings, at least for text
classification tasks. In this paper, we challenge those results, showing that
classical data augmentation is simply a way of performing better fine-tuning,
and that spending more time fine-tuning before applying data augmentation
negates its effect. This is a significant contribution as it answers several
questions that were left open in recent years, namely~: which DA technique
performs best (all of them as long as they generate data close enough to the
training set as to not impair training) and why did DA show positive results
(facilitates training of network). We furthermore show that zero and few-shot
data generation via conversational agents such as ChatGPT or LLama2 can
increase performances, concluding that this form of data augmentation does
still work, even if classical methods do not.
Related papers
- On-the-fly Data Augmentation for Forecasting with Deep Learning [0.35998666903987897]
We propose OnDAT (On-the-fly Data Augmentation for Time series) to address this issue.
By generating a new augmented dataset on each iteration, the model is exposed to a constantly changing augmented data variations.
We validated the proposed approach using a state-of-the-art deep learning forecasting method and 8 benchmark datasets containing a total of 75797 time series.
arXiv Detail & Related papers (2024-04-25T17:16:13Z) - LESS: Selecting Influential Data for Targeted Instruction Tuning [64.78894228923619]
We propose LESS, an efficient algorithm to estimate data influences and perform Low-rank gradiEnt Similarity Search for instruction data selection.
We show that training on a LESS-selected 5% of the data can often outperform training on the full dataset across diverse downstream tasks.
Our method goes beyond surface form cues to identify data that the necessary reasoning skills for the intended downstream application.
arXiv Detail & Related papers (2024-02-06T19:18:04Z) - Efficient Grammatical Error Correction Via Multi-Task Training and
Optimized Training Schedule [55.08778142798106]
We propose auxiliary tasks that exploit the alignment between the original and corrected sentences.
We formulate each task as a sequence-to-sequence problem and perform multi-task training.
We find that the order of datasets used for training and even individual instances within a dataset may have important effects on the final performance.
arXiv Detail & Related papers (2023-11-20T14:50:12Z) - Domain Generalization by Rejecting Extreme Augmentations [13.114457707388283]
We show that for out-of-domain and domain generalization settings, data augmentation can provide a conspicuous and robust improvement in performance.
We propose a simple training procedure: (i) use uniform sampling on standard data augmentation transformations; (ii) increase the strength transformations to account for the higher data variance expected when working out-of-domain, and (iii) devise a new reward function to reject extreme transformations that can harm the training.
arXiv Detail & Related papers (2023-10-10T14:46:22Z) - AugGPT: Leveraging ChatGPT for Text Data Augmentation [59.76140039943385]
We propose a text data augmentation approach based on ChatGPT (named AugGPT)
AugGPT rephrases each sentence in the training samples into multiple conceptually similar but semantically different samples.
Experiment results on few-shot learning text classification tasks show the superior performance of the proposed AugGPT approach.
arXiv Detail & Related papers (2023-02-25T06:58:16Z) - Dataset Distillation: A Comprehensive Review [76.26276286545284]
dataset distillation (DD) aims to derive a much smaller dataset containing synthetic samples, based on which the trained models yield performance comparable with those trained on the original dataset.
This paper gives a comprehensive review and summary of recent advances in DD and its application.
arXiv Detail & Related papers (2023-01-17T17:03:28Z) - Augmentation-Aware Self-Supervision for Data-Efficient GAN Training [68.81471633374393]
Training generative adversarial networks (GANs) with limited data is challenging because the discriminator is prone to overfitting.
We propose a novel augmentation-aware self-supervised discriminator that predicts the augmentation parameter of the augmented data.
We compare our method with state-of-the-art (SOTA) methods using the class-conditional BigGAN and unconditional StyleGAN2 architectures.
arXiv Detail & Related papers (2022-05-31T10:35:55Z) - FlipDA: Effective and Robust Data Augmentation for Few-Shot Learning [27.871007011425775]
We propose a novel data augmentation method FlipDA that jointly uses a generative model and a classifier to generate label-flipped data.
Experiments show that FlipDA achieves a good tradeoff between effectiveness and robustness---it substantially improves many tasks while not negatively affecting the others.
arXiv Detail & Related papers (2021-08-13T17:51:31Z) - Data Weighted Training Strategies for Grammatical Error Correction [8.370770440898454]
We show how to incorporate delta-log-perplexity, a type of example scoring, into a training schedule for Grammatical Error Correction (GEC)
Models trained on scored data achieve state-of-the-art results on common GEC test sets.
arXiv Detail & Related papers (2020-08-07T03:30:14Z) - Complex Wavelet SSIM based Image Data Augmentation [0.0]
We look at the MNIST handwritten dataset an image dataset used for digit recognition.
We take a detailed look into one of the most popular augmentation techniques used for this data set elastic deformation.
We propose to use a similarity measure called Complex Wavelet Structural Similarity Index Measure (CWSSIM) to selectively filter out the irrelevant data.
arXiv Detail & Related papers (2020-07-11T21:11:46Z) - Omni-supervised Facial Expression Recognition via Distilled Data [120.11782405714234]
We propose omni-supervised learning to exploit reliable samples in a large amount of unlabeled data for network training.
We experimentally verify that the new dataset can significantly improve the ability of the learned FER model.
To tackle this, we propose to apply a dataset distillation strategy to compress the created dataset into several informative class-wise images.
arXiv Detail & Related papers (2020-05-18T09:36:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.