Data Augmentation Approaches in Natural Language Processing: A Survey
- URL: http://arxiv.org/abs/2110.01852v1
- Date: Tue, 5 Oct 2021 07:35:32 GMT
- Title: Data Augmentation Approaches in Natural Language Processing: A Survey
- Authors: Bohan Li, Yutai Hou, Wanxiang Che
- Abstract summary: Data augmentation (DA) alleviates data scarcity scenarios where deep learning techniques may fail.
One of the main focuses of the DA methods is to improve the diversity of training data, thereby helping the model to better generalize to unseen testing data.
We frame DA methods into three categories based on the diversity of augmented data, including paraphrasing, noising, and sampling.
- Score: 28.91744006146676
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: As an effective strategy, data augmentation (DA) alleviates data scarcity
scenarios where deep learning techniques may fail. It is widely applied in
computer vision then introduced to natural language processing and achieves
improvements in many tasks. One of the main focuses of the DA methods is to
improve the diversity of training data, thereby helping the model to better
generalize to unseen testing data. In this survey, we frame DA methods into
three categories based on the diversity of augmented data, including
paraphrasing, noising, and sampling. Our paper sets out to analyze DA methods
in detail according to the above categories. Further, we also introduce their
applications in NLP tasks as well as the challenges.
Related papers
- Generalized Group Data Attribution [28.056149996461286]
Data Attribution methods quantify the influence of individual training data points on model outputs.
Existing DA methods are often computationally intensive, limiting their applicability to large-scale machine learning models.
We introduce the Generalized Group Data Attribution (GGDA) framework, which computationally simplifies DA by attributing to groups of training points instead of individual ones.
arXiv Detail & Related papers (2024-10-13T17:51:21Z) - Unleashing the Power of Data Tsunami: A Comprehensive Survey on Data Assessment and Selection for Instruction Tuning of Language Models [33.488331159912136]
Instruction tuning plays a critical role in aligning large language models (LLMs) with human preference.
Data assessment and selection methods have been proposed in the fields of natural language processing (NLP) and deep learning.
We present a comprehensive review on existing literature of data assessment and selection especially for instruction tuning of LLMs.
arXiv Detail & Related papers (2024-08-04T16:50:07Z) - A Comprehensive Survey on Data Augmentation [55.355273602421384]
Data augmentation is a technique that generates high-quality artificial data by manipulating existing data samples.
Existing literature surveys only focus on a certain type of specific modality data.
We propose a more enlightening taxonomy that encompasses data augmentation techniques for different common data modalities.
arXiv Detail & Related papers (2024-05-15T11:58:08Z) - Data Augmentation using Large Language Models: Data Perspectives, Learning Paradigms and Challenges [47.45993726498343]
Data augmentation (DA) has emerged as a pivotal technique for enhancing model performance by diversifying training examples without the need for additional data collection.
This survey explores the transformative impact of large language models (LLMs) on DA, particularly addressing the unique challenges and opportunities they present in the context of natural language processing (NLP) and beyond.
arXiv Detail & Related papers (2024-03-05T14:11:54Z) - LESS: Selecting Influential Data for Targeted Instruction Tuning [64.78894228923619]
We propose LESS, an efficient algorithm to estimate data influences and perform Low-rank gradiEnt Similarity Search for instruction data selection.
We show that training on a LESS-selected 5% of the data can often outperform training on the full dataset across diverse downstream tasks.
Our method goes beyond surface form cues to identify data that the necessary reasoning skills for the intended downstream application.
arXiv Detail & Related papers (2024-02-06T19:18:04Z) - Rethink the Effectiveness of Text Data Augmentation: An Empirical
Analysis [4.771833920251869]
We evaluate the effectiveness of three different FT methods in conjugation with back-translation across an array of 7 diverse NLP tasks.
Our findings reveal that continued pre-training on augmented data can effectively improve the FT performance of the downstream tasks.
Our finding highlights the potential of DA as a powerful tool for bolstering LMs' performance.
arXiv Detail & Related papers (2023-06-13T10:14:58Z) - Improving Few-Shot Generalization by Exploring and Exploiting Auxiliary
Data [100.33096338195723]
We focus on Few-shot Learning with Auxiliary Data (FLAD)
FLAD assumes access to auxiliary data during few-shot learning in hopes of improving generalization.
We propose two algorithms -- EXP3-FLAD and UCB1-FLAD -- and compare them with prior FLAD methods that either explore or exploit.
arXiv Detail & Related papers (2023-02-01T18:59:36Z) - Style Transfer as Data Augmentation: A Case Study on Named Entity
Recognition [17.892385961143173]
We propose a new method to transform the text from a high-resource domain to a low-resource domain by changing its style-related attributes.
We design a constrained decoding algorithm along with a set of key ingredients for data selection to guarantee the generation of valid and coherent data.
Our approach is a practical solution to data scarcity, and we expect it to be applicable to other NLP tasks.
arXiv Detail & Related papers (2022-10-14T16:02:03Z) - DAGA: Data Augmentation with a Generation Approach for Low-resource
Tagging Tasks [88.62288327934499]
We propose a novel augmentation method with language models trained on the linearized labeled sentences.
Our method is applicable to both supervised and semi-supervised settings.
arXiv Detail & Related papers (2020-11-03T07:49:15Z) - Data Augmentation for Spoken Language Understanding via Pretrained
Language Models [113.56329266325902]
Training of spoken language understanding (SLU) models often faces the problem of data scarcity.
We put forward a data augmentation method using pretrained language models to boost the variability and accuracy of generated utterances.
arXiv Detail & Related papers (2020-04-29T04:07:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.