DAGAM: Data Augmentation with Generation And Modification
- URL: http://arxiv.org/abs/2204.02633v1
- Date: Wed, 6 Apr 2022 07:20:45 GMT
- Title: DAGAM: Data Augmentation with Generation And Modification
- Authors: Byeong-Cheol Jo, Tak-Sung Heo, Yeongjoon Park, Yongmin Yoo, Won Ik
Cho, Kyungsun Kim
- Abstract summary: In pre-trained language models, under-fitting often occurs due to the size of the model being very large.
We introduce three data augmentation schemes that help reduce underfitting problems of large-scale language models.
- Score: 3.063234089519162
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Text classification is a representative downstream task of natural language
processing, and has exhibited excellent performance since the advent of
pre-trained language models based on Transformer architecture. However, in
pre-trained language models, under-fitting often occurs due to the size of the
model being very large compared to the amount of available training data. Along
with significant importance of data collection in modern machine learning
paradigm, studies have been actively conducted for natural language data
augmentation. In light of this, we introduce three data augmentation schemes
that help reduce underfitting problems of large-scale language models.
Primarily we use a generation model for data augmentation, which is defined as
Data Augmentation with Generation (DAG). Next, we augment data using text
modification techniques such as corruption and word order change (Data
Augmentation with Modification, DAM). Finally, we propose Data Augmentation
with Generation And Modification (DAGAM), which combines DAG and DAM techniques
for a boosted performance. We conduct data augmentation for six benchmark
datasets of text classification task, and verify the usefulness of DAG, DAM,
and DAGAM through BERT-based fine-tuning and evaluation, deriving better
results compared to the performance with original datasets.
Related papers
- Unifying Structured Data as Graph for Data-to-Text Pre-Training [69.96195162337793]
Data-to-text (D2T) generation aims to transform structured data into natural language text.
Data-to-text pre-training has proved to be powerful in enhancing D2T generation.
We propose a structure-enhanced pre-training method for D2T generation by designing a structure-enhanced Transformer.
arXiv Detail & Related papers (2024-01-02T12:23:49Z) - LLM-Assisted Code Cleaning For Training Accurate Code Generators [53.087019724256606]
We investigate data quality for code and find that making the code more structured and readable leads to improved code generation performance of the system.
We build a novel data-cleaning pipeline that uses these principles to transform existing programs.
We evaluate our approach on two challenging algorithmic code generation benchmarks and find that fine-tuning CodeLLaMa-7B improves the performance by up to 30% compared to fine-tuning on the original dataset.
arXiv Detail & Related papers (2023-11-25T02:45:50Z) - Investigating Masking-based Data Generation in Language Models [0.0]
A feature of BERT and models with similar architecture is the objective of masked language modeling.
Data augmentation is a data-driven technique widely used in machine learning.
Recent studies have utilized masked language model to generate artificially augmented data for NLP downstream tasks.
arXiv Detail & Related papers (2023-06-16T16:48:27Z) - Diversify Your Vision Datasets with Automatic Diffusion-Based
Augmentation [66.6546668043249]
ALIA (Automated Language-guided Image Augmentation) is a method which utilizes large vision and language models to automatically generate natural language descriptions of a dataset's domains.
To maintain data integrity, a model trained on the original dataset filters out minimal image edits and those which corrupt class-relevant information.
We show that ALIA is able to surpasses traditional data augmentation and text-to-image generated data on fine-grained classification tasks.
arXiv Detail & Related papers (2023-05-25T17:43:05Z) - ReGen: Zero-Shot Text Classification via Training Data Generation with
Progressive Dense Retrieval [22.882301169283323]
We propose a retrieval-enhanced framework to create training data from a general-domain unlabeled corpus.
Experiments on nine datasets demonstrate that REGEN achieves 4.3% gain over the strongest baselines and saves around 70% of the time compared to baselines using large NLG models.
arXiv Detail & Related papers (2023-05-18T04:30:09Z) - Efficient Training of Language Models to Fill in the Middle [17.118891860985123]
We show that autoregressive language models can learn to infill text after we apply a straightforward transformation to the dataset.
We use these ablations to prescribe strong default settings and best practices to train FIM models.
We have released our best infilling model trained with best practices in our API, and release our infilling benchmarks to aid future research.
arXiv Detail & Related papers (2022-07-28T17:40:47Z) - N-Grammer: Augmenting Transformers with latent n-grams [35.39961549040385]
We propose a simple yet effective modification to the Transformer architecture inspired by the literature in statistical language modeling, by augmenting the model with n-grams that are constructed from a discrete latent representation of the text sequence.
We evaluate our model, the N-Grammer on language modeling on the C4 data-set as well as text classification on the SuperGLUE data-set, and find that it outperforms several strong baselines such as the Transformer and the Primer.
arXiv Detail & Related papers (2022-07-13T17:18:02Z) - Curriculum-Based Self-Training Makes Better Few-Shot Learners for
Data-to-Text Generation [56.98033565736974]
We propose Curriculum-Based Self-Training (CBST) to leverage unlabeled data in a rearranged order determined by the difficulty of text generation.
Our method can outperform fine-tuning and task-adaptive pre-training methods, and achieve state-of-the-art performance in the few-shot setting of data-to-text generation.
arXiv Detail & Related papers (2022-06-06T16:11:58Z) - Improving Classifier Training Efficiency for Automatic Cyberbullying
Detection with Feature Density [58.64907136562178]
We study the effectiveness of Feature Density (FD) using different linguistically-backed feature preprocessing methods.
We hypothesise that estimating dataset complexity allows for the reduction of the number of required experiments.
The difference in linguistic complexity of datasets allows us to additionally discuss the efficacy of linguistically-backed word preprocessing.
arXiv Detail & Related papers (2021-11-02T15:48:28Z) - Improving Commonsense Causal Reasoning by Adversarial Training and Data
Augmentation [14.92157586545743]
This paper presents a number of techniques for making models more robust in the domain of causal reasoning.
We show a statistically significant improvement on performance and on both datasets, even with only a small number of additionally generated data points.
arXiv Detail & Related papers (2021-01-13T09:55:29Z) - SDA: Improving Text Generation with Self Data Augmentation [88.24594090105899]
We propose to improve the standard maximum likelihood estimation (MLE) paradigm by incorporating a self-imitation-learning phase for automatic data augmentation.
Unlike most existing sentence-level augmentation strategies, our method is more general and could be easily adapted to any MLE-based training procedure.
arXiv Detail & Related papers (2021-01-02T01:15:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.