Related papers: Data Augmentation for Neural NLP

Data Augmentation for Neural NLP

URL: http://arxiv.org/abs/2302.11412v1
Date: Wed, 22 Feb 2023 14:47:15 GMT
Title: Data Augmentation for Neural NLP
Authors: Domagoj Plu\v{s}\v{c}ec, Jan \v{S}najder
Abstract summary: Data augmentation is a low-cost approach for tackling data scarcity. This paper gives an overview of current state-of-the-art data augmentation methods used for natural language processing.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Data scarcity is a problem that occurs in languages and tasks where we do not have large amounts of labeled data but want to use state-of-the-art models. Such models are often deep learning models that require a significant amount of data to train. Acquiring data for various machine learning problems is accompanied by high labeling costs. Data augmentation is a low-cost approach for tackling data scarcity. This paper gives an overview of current state-of-the-art data augmentation methods used for natural language processing, with an emphasis on methods for neural and transformer-based models. Furthermore, it discusses the practical challenges of data augmentation, possible mitigations, and directions for future research.

Related papers

Meta-Learning and Synthetic Data for Automated Pretraining and Finetuning [2.657867981416885]
Growing number of pretrained models in Machine Learning (ML) presents significant challenges for practitioners.<n>As models grow in scale, the increasing reliance on real-world data poses a bottleneck for training and requires leveraging data more effectively.<n>This dissertation adopts meta-learning to extend automated machine learning to the deep learning domain.
arXiv Detail & Related papers (2025-06-11T12:48:45Z)
Learning with Less: Knowledge Distillation from Large Language Models via Unlabeled Data [54.934578742209716]
In real-world NLP applications, Large Language Models (LLMs) offer promising solutions due to their extensive training on vast datasets. LLKD is an adaptive sample selection method that incorporates signals from both the teacher and student. Our comprehensive experiments show that LLKD achieves superior performance across various datasets with higher data efficiency.
arXiv Detail & Related papers (2024-11-12T18:57:59Z)
Unsupervised Data Validation Methods for Efficient Model Training [0.0]
State-of-the-art models in natural language processing (NLP), text-to-speech (TTS), speech-to-text (STT) and vision-language models (VLM) rely heavily on large datasets. This research explores key areas such as defining "quality data," developing methods for generating appropriate data and enhancing accessibility to model training.
arXiv Detail & Related papers (2024-10-10T13:00:53Z)
Leveraging Data Augmentation for Process Information Extraction [0.0]
We investigate the application of data augmentation for natural language text data. Data augmentation is an important component in enabling machine learning methods for the task of business process model generation from natural language text.
arXiv Detail & Related papers (2024-04-11T06:32:03Z)
The Frontier of Data Erasure: Machine Unlearning for Large Language Models [56.26002631481726]
Large Language Models (LLMs) are foundational to AI advancements. LLMs pose risks by potentially memorizing and disseminating sensitive, biased, or copyrighted information. Machine unlearning emerges as a cutting-edge solution to mitigate these concerns.
arXiv Detail & Related papers (2024-03-23T09:26:15Z)
A survey of synthetic data augmentation methods in computer vision [0.0]
This paper presents an extensive review of synthetic data augmentation techniques. We focus on the important data generation and augmentation techniques, general scope of application and specific use-cases. We provide a summary of common synthetic datasets for training computer vision models.
arXiv Detail & Related papers (2024-03-15T07:34:08Z)
Making Large Language Models Better Data Creators [22.0882632635255]
Large language models (LLMs) have advanced the state-of-the-art in NLP significantly. deploying them for downstream applications is still challenging due to cost, responsiveness, control, or concerns around privacy and security. We propose a unified data creation pipeline that requires only a single format example.
arXiv Detail & Related papers (2023-10-31T01:08:34Z)
Deepfake audio as a data augmentation technique for training automatic speech to text transcription models [55.2480439325792]
We propose a framework that approaches data augmentation based on deepfake audio. A dataset produced by Indians (in English) was selected, ensuring the presence of a single accent.
arXiv Detail & Related papers (2023-09-22T11:33:03Z)
Deep invariant networks with differentiable augmentation layers [87.22033101185201]
Methods for learning data augmentation policies require held-out data and are based on bilevel optimization problems. We show that our approach is easier and faster to train than modern automatic data augmentation techniques.
arXiv Detail & Related papers (2022-02-04T14:12:31Z)
Improving Classifier Training Efficiency for Automatic Cyberbullying Detection with Feature Density [58.64907136562178]
We study the effectiveness of Feature Density (FD) using different linguistically-backed feature preprocessing methods. We hypothesise that estimating dataset complexity allows for the reduction of the number of required experiments. The difference in linguistic complexity of datasets allows us to additionally discuss the efficacy of linguistically-backed word preprocessing.
arXiv Detail & Related papers (2021-11-02T15:48:28Z)
Data Augmentation for Deep Candlestick Learner [2.104922050913737]
We propose a Modified Local Search Attack Sampling method to augment the candlestick data. Our results show that the proposed method can generate high-quality data which are hard to distinguish by human.
arXiv Detail & Related papers (2020-05-14T06:02:31Z)
Data Augmentation for Spoken Language Understanding via Pretrained Language Models [113.56329266325902]
Training of spoken language understanding (SLU) models often faces the problem of data scarcity. We put forward a data augmentation method using pretrained language models to boost the variability and accuracy of generated utterances.
arXiv Detail & Related papers (2020-04-29T04:07:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.