On Evaluation Protocols for Data Augmentation in a Limited Data Scenario
- URL: http://arxiv.org/abs/2402.14895v2
- Date: Mon, 16 Sep 2024 20:11:19 GMT
- Title: On Evaluation Protocols for Data Augmentation in a Limited Data Scenario
- Authors: Frédéric Piedboeuf, Philippe Langlais,
- Abstract summary: We show that classical data augmentation (which modify sentences) is simply a way of performing better fine-tuning.
We further show that zero- and few-shot DA via conversational agents such as ChatGPT or LLama2 can increase performances.
- Score: 11.09784120582206
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Textual data augmentation (DA) is a prolific field of study where novel techniques to create artificial data are regularly proposed, and that has demonstrated great efficiency on small data settings, at least for text classification tasks. In this paper, we challenge those results, showing that classical data augmentation (which modify sentences) is simply a way of performing better fine-tuning, and that spending more time doing so before applying data augmentation negates its effect. This is a significant contribution as it answers several questions that were left open in recent years, namely~: which DA technique performs best (all of them as long as they generate data close enough to the training set, as to not impair training) and why did DA show positive results (facilitates training of network). We further show that zero- and few-shot DA via conversational agents such as ChatGPT or LLama2 can increase performances, confirming that this form of data augmentation is preferable to classical methods.
Related papers
- Fine-tuning can Help Detect Pretraining Data from Large Language Models [7.7209640786782385]
Current methods differentiate members and non-members by designing scoring functions, like Perplexity and Min-k%.
We introduce a novel and effective method termed Fine-tuned Score Deviation (FSD), which improves the performance of current scoring functions for pretraining data detection.
arXiv Detail & Related papers (2024-10-09T15:36:42Z) - LESS: Selecting Influential Data for Targeted Instruction Tuning [64.78894228923619]
We propose LESS, an efficient algorithm to estimate data influences and perform Low-rank gradiEnt Similarity Search for instruction data selection.
We show that training on a LESS-selected 5% of the data can often outperform training on the full dataset across diverse downstream tasks.
Our method goes beyond surface form cues to identify data that the necessary reasoning skills for the intended downstream application.
arXiv Detail & Related papers (2024-02-06T19:18:04Z) - Efficient Grammatical Error Correction Via Multi-Task Training and
Optimized Training Schedule [55.08778142798106]
We propose auxiliary tasks that exploit the alignment between the original and corrected sentences.
We formulate each task as a sequence-to-sequence problem and perform multi-task training.
We find that the order of datasets used for training and even individual instances within a dataset may have important effects on the final performance.
arXiv Detail & Related papers (2023-11-20T14:50:12Z) - Domain Generalization by Rejecting Extreme Augmentations [13.114457707388283]
We show that for out-of-domain and domain generalization settings, data augmentation can provide a conspicuous and robust improvement in performance.
We propose a simple training procedure: (i) use uniform sampling on standard data augmentation transformations; (ii) increase the strength transformations to account for the higher data variance expected when working out-of-domain, and (iii) devise a new reward function to reject extreme transformations that can harm the training.
arXiv Detail & Related papers (2023-10-10T14:46:22Z) - AugGPT: Leveraging ChatGPT for Text Data Augmentation [59.76140039943385]
We propose a text data augmentation approach based on ChatGPT (named AugGPT)
AugGPT rephrases each sentence in the training samples into multiple conceptually similar but semantically different samples.
Experiment results on few-shot learning text classification tasks show the superior performance of the proposed AugGPT approach.
arXiv Detail & Related papers (2023-02-25T06:58:16Z) - Dataset Distillation: A Comprehensive Review [76.26276286545284]
dataset distillation (DD) aims to derive a much smaller dataset containing synthetic samples, based on which the trained models yield performance comparable with those trained on the original dataset.
This paper gives a comprehensive review and summary of recent advances in DD and its application.
arXiv Detail & Related papers (2023-01-17T17:03:28Z) - Augmentation-Aware Self-Supervision for Data-Efficient GAN Training [68.81471633374393]
Training generative adversarial networks (GANs) with limited data is challenging because the discriminator is prone to overfitting.
We propose a novel augmentation-aware self-supervised discriminator that predicts the augmentation parameter of the augmented data.
We compare our method with state-of-the-art (SOTA) methods using the class-conditional BigGAN and unconditional StyleGAN2 architectures.
arXiv Detail & Related papers (2022-05-31T10:35:55Z) - FlipDA: Effective and Robust Data Augmentation for Few-Shot Learning [27.871007011425775]
We propose a novel data augmentation method FlipDA that jointly uses a generative model and a classifier to generate label-flipped data.
Experiments show that FlipDA achieves a good tradeoff between effectiveness and robustness---it substantially improves many tasks while not negatively affecting the others.
arXiv Detail & Related papers (2021-08-13T17:51:31Z) - Data Weighted Training Strategies for Grammatical Error Correction [8.370770440898454]
We show how to incorporate delta-log-perplexity, a type of example scoring, into a training schedule for Grammatical Error Correction (GEC)
Models trained on scored data achieve state-of-the-art results on common GEC test sets.
arXiv Detail & Related papers (2020-08-07T03:30:14Z) - Complex Wavelet SSIM based Image Data Augmentation [0.0]
We look at the MNIST handwritten dataset an image dataset used for digit recognition.
We take a detailed look into one of the most popular augmentation techniques used for this data set elastic deformation.
We propose to use a similarity measure called Complex Wavelet Structural Similarity Index Measure (CWSSIM) to selectively filter out the irrelevant data.
arXiv Detail & Related papers (2020-07-11T21:11:46Z) - Omni-supervised Facial Expression Recognition via Distilled Data [120.11782405714234]
We propose omni-supervised learning to exploit reliable samples in a large amount of unlabeled data for network training.
We experimentally verify that the new dataset can significantly improve the ability of the learned FER model.
To tackle this, we propose to apply a dataset distillation strategy to compress the created dataset into several informative class-wise images.
arXiv Detail & Related papers (2020-05-18T09:36:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.