An Empirical Survey of Data Augmentation for Limited Data Learning in
NLP
- URL: http://arxiv.org/abs/2106.07499v1
- Date: Mon, 14 Jun 2021 15:27:22 GMT
- Title: An Empirical Survey of Data Augmentation for Limited Data Learning in
NLP
- Authors: Jiaao Chen, Derek Tam, Colin Raffel, Mohit Bansal and Diyi Yang
- Abstract summary: dependence on abundant data prevents NLP models from being applied to low-resource settings or novel tasks.
Data augmentation methods have been explored as a means of improving data efficiency in NLP.
We provide an empirical survey of recent progress on data augmentation for NLP in the limited labeled data setting.
- Score: 88.65488361532158
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: NLP has achieved great progress in the past decade through the use of neural
models and large labeled datasets. The dependence on abundant data prevents NLP
models from being applied to low-resource settings or novel tasks where
significant time, money, or expertise is required to label massive amounts of
textual data. Recently, data augmentation methods have been explored as a means
of improving data efficiency in NLP. To date, there has been no systematic
empirical overview of data augmentation for NLP in the limited labeled data
setting, making it difficult to understand which methods work in which
settings. In this paper, we provide an empirical survey of recent progress on
data augmentation for NLP in the limited labeled data setting, summarizing the
landscape of methods (including token-level augmentations, sentence-level
augmentations, adversarial augmentations, and hidden-space augmentations) and
carrying out experiments on 11 datasets covering topics/news classification,
inference tasks, paraphrasing tasks, and single-sentence tasks. Based on the
results, we draw several conclusions to help practitioners choose appropriate
augmentations in different settings and discuss the current challenges and
future directions for limited data learning in NLP.
Related papers
- A Survey on Data Synthesis and Augmentation for Large Language Models [35.59526251210408]
This paper reviews and summarizes data generation techniques throughout the lifecycle of Large Language Models.
We discuss the current constraints faced by these methods and investigate potential pathways for future development and research.
arXiv Detail & Related papers (2024-10-16T16:12:39Z) - A Comprehensive Survey on Data Augmentation [55.355273602421384]
Data augmentation is a technique that generates high-quality artificial data by manipulating existing data samples.
Existing literature surveys only focus on a certain type of specific modality data.
We propose a more enlightening taxonomy that encompasses data augmentation techniques for different common data modalities.
arXiv Detail & Related papers (2024-05-15T11:58:08Z) - Revealing Trends in Datasets from the 2022 ACL and EMNLP Conferences [16.66079305798581]
Transformers have given birth to pre-trained large language models (PLMs)
The need to have quality datasets has prompted NLP researchers to continue creating new datasets to satisfy particular needs.
This work aims to uncover the trends and insights mined within these datasets.
arXiv Detail & Related papers (2024-03-31T15:13:15Z) - A Survey on Data Selection for Language Models [148.300726396877]
Data selection methods aim to determine which data points to include in a training dataset.
Deep learning is mostly driven by empirical evidence and experimentation on large-scale data is expensive.
Few organizations have the resources for extensive data selection research.
arXiv Detail & Related papers (2024-02-26T18:54:35Z) - Surveying the Landscape of Text Summarization with Deep Learning: A
Comprehensive Review [2.4185510826808487]
Deep learning has revolutionized natural language processing (NLP) by enabling the development of models that can learn complex representations of language data.
Deep learning models for NLP typically use large amounts of data to train deep neural networks, allowing them to learn the patterns and relationships in language data.
Applying deep learning to text summarization refers to the use of deep neural networks to perform text summarization tasks.
arXiv Detail & Related papers (2023-10-13T21:24:37Z) - Efficient Methods for Natural Language Processing: A Survey [76.34572727185896]
This survey synthesizes and relates current methods and findings in efficient NLP.
We aim to provide both guidance for conducting NLP under limited resources, and point towards promising research directions for developing more efficient methods.
arXiv Detail & Related papers (2022-08-31T20:32:35Z) - KnowDA: All-in-One Knowledge Mixture Model for Data Augmentation in
Few-Shot NLP [68.43279384561352]
Existing data augmentation algorithms leverage task-independent rules or fine-tune general-purpose pre-trained language models.
These methods have trivial task-specific knowledge and are limited to yielding low-quality synthetic data for weak baselines in simple tasks.
We propose the Knowledge Mixture Data Augmentation Model (KnowDA): an encoder-decoder LM pretrained on a mixture of diverse NLP tasks.
arXiv Detail & Related papers (2022-06-21T11:34:02Z) - A Survey of Data Augmentation Approaches for NLP [12.606206831969262]
Data augmentation has recently seen increased interest in NLP due to more work in low-resource domains, new tasks, and the popularity of large-scale neural networks.
Despite this recent upsurge, this area is still relatively underexplored, perhaps due to the challenges posed by the discrete nature of language data.
We present a comprehensive and unifying survey of data augmentation for NLP by summarizing the literature in a structured manner.
arXiv Detail & Related papers (2021-05-07T06:03:45Z) - DAGA: Data Augmentation with a Generation Approach for Low-resource
Tagging Tasks [88.62288327934499]
We propose a novel augmentation method with language models trained on the linearized labeled sentences.
Our method is applicable to both supervised and semi-supervised settings.
arXiv Detail & Related papers (2020-11-03T07:49:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.