A Survey on Data Augmentation in Large Model Era
- URL: http://arxiv.org/abs/2401.15422v2
- Date: Mon, 4 Mar 2024 16:55:15 GMT
- Title: A Survey on Data Augmentation in Large Model Era
- Authors: Yue Zhou, Chenlu Guo, Xu Wang, Yi Chang, and Yuan Wu
- Abstract summary: Large models, encompassing large language and diffusion models, have shown exceptional promise in approximating human-level intelligence.
With continuous updates to these models, the existing reservoir of high-quality data may soon be depleted.
This paper offers an exhaustive review of large model-driven data augmentation methods.
- Score: 16.05117556207015
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large models, encompassing large language and diffusion models, have shown
exceptional promise in approximating human-level intelligence, garnering
significant interest from both academic and industrial spheres. However, the
training of these large models necessitates vast quantities of high-quality
data, and with continuous updates to these models, the existing reservoir of
high-quality data may soon be depleted. This challenge has catalyzed a surge in
research focused on data augmentation methods. Leveraging large models, these
data augmentation techniques have outperformed traditional approaches. This
paper offers an exhaustive review of large model-driven data augmentation
methods, adopting a comprehensive perspective. We begin by establishing a
classification of relevant studies into three main categories: image
augmentation, text augmentation, and paired data augmentation. Following this,
we delve into various data post-processing techniques pertinent to large
model-based data augmentation. Our discussion then expands to encompass the
array of applications for these data augmentation methods within natural
language processing, computer vision, and audio signal processing. We proceed
to evaluate the successes and limitations of large model-based data
augmentation across different scenarios. Concluding our review, we highlight
prospective challenges and avenues for future exploration in the field of data
augmentation. Our objective is to furnish researchers with critical insights,
ultimately contributing to the advancement of more sophisticated large models.
We consistently maintain the related open-source materials at:
https://github.com/MLGroup-JLU/LLM-data-aug-survey.
Related papers
- Training Data for Large Language Model [2.1178416840822027]
ChatGPT surpassed previous models in terms of parameters and the scale of its pretraining corpus.
ChatGPT achieved revolutionary performance improvements through fine-tuning on a vast amount of high-quality, human-annotated data.
This paper summarizes the current state of pretraining and fine-tuning data for training large-scale language models.
arXiv Detail & Related papers (2024-11-12T11:09:58Z) - A Simple Background Augmentation Method for Object Detection with Diffusion Model [53.32935683257045]
In computer vision, it is well-known that a lack of data diversity will impair model performance.
We propose a simple yet effective data augmentation approach by leveraging advancements in generative models.
Background augmentation, in particular, significantly improves the models' robustness and generalization capabilities.
arXiv Detail & Related papers (2024-08-01T07:40:00Z) - A Comprehensive Survey on Data Augmentation [55.355273602421384]
Data augmentation is a technique that generates high-quality artificial data by manipulating existing data samples.
Existing literature surveys only focus on a certain type of specific modality data.
We propose a more enlightening taxonomy that encompasses data augmentation techniques for different common data modalities.
arXiv Detail & Related papers (2024-05-15T11:58:08Z) - Data Augmentation in Human-Centric Vision [54.97327269866757]
This survey presents a comprehensive analysis of data augmentation techniques in human-centric vision tasks.
It delves into a wide range of research areas including person ReID, human parsing, human pose estimation, and pedestrian detection.
Our work categorizes data augmentation methods into two main types: data generation and data perturbation.
arXiv Detail & Related papers (2024-03-13T16:05:18Z) - Data-efficient Large Vision Models through Sequential Autoregression [58.26179273091461]
We develop an efficient, autoregression-based vision model on a limited dataset.
We demonstrate how this model achieves proficiency in a spectrum of visual tasks spanning both high-level and low-level semantic understanding.
Our empirical evaluations underscore the model's agility in adapting to various tasks, heralding a significant reduction in the parameter footprint.
arXiv Detail & Related papers (2024-02-07T13:41:53Z) - Comprehensive Exploration of Synthetic Data Generation: A Survey [4.485401662312072]
This work surveys 417 Synthetic Data Generation models over the last decade.
The findings reveal increased model performance and complexity, with neural network-based approaches prevailing.
Computer vision dominates, with GANs as primary generative models, while diffusion models, transformers, and RNNs compete.
arXiv Detail & Related papers (2024-01-04T20:23:51Z) - Data-Centric Long-Tailed Image Recognition [49.90107582624604]
Long-tail models exhibit a strong demand for high-quality data.
Data-centric approaches aim to enhance both the quantity and quality of data to improve model performance.
There is currently a lack of research into the underlying mechanisms explaining the effectiveness of information augmentation.
arXiv Detail & Related papers (2023-11-03T06:34:37Z) - Revisiting Data Augmentation in Model Compression: An Empirical and
Comprehensive Study [17.970216875558638]
In this paper, we revisit the usage of data augmentation in model compression.
We show that models in different sizes prefer data augmentation with different magnitudes.
The prediction of a pre-trained large model can be utilized to measure the difficulty of data augmentation.
arXiv Detail & Related papers (2023-05-22T17:05:06Z) - Phased Data Augmentation for Training a Likelihood-Based Generative Model with Limited Data [0.0]
Generative models excel in creating realistic images, yet their dependency on extensive datasets for training presents significant challenges.
Current data-efficient methods largely focus on GAN architectures, leaving a gap in training other types of generative models.
"phased data augmentation" is a novel technique that addresses this gap by optimizing training in limited data scenarios without altering the inherent data distribution.
arXiv Detail & Related papers (2023-05-22T03:38:59Z) - Advanced Data Augmentation Approaches: A Comprehensive Survey and Future
directions [57.30984060215482]
We provide a background of data augmentation, a novel and comprehensive taxonomy of reviewed data augmentation techniques, and the strengths and weaknesses (wherever possible) of each technique.
We also provide comprehensive results of the data augmentation effect on three popular computer vision tasks, such as image classification, object detection and semantic segmentation.
arXiv Detail & Related papers (2023-01-07T11:37:32Z) - Exploring the Effects of Data Augmentation for Drivable Area
Segmentation [0.0]
We focus on investigating the benefits of data augmentation by analyzing pre-existing image datasets.
Our results show that the performance and robustness of existing state of the art (or SOTA) models can be increased dramatically.
arXiv Detail & Related papers (2022-08-06T03:39:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.