Text Data Augmentation for Large Language Models: A Comprehensive Survey of Methods, Challenges, and Opportunities
- URL: http://arxiv.org/abs/2501.18845v1
- Date: Fri, 31 Jan 2025 01:50:49 GMT
- Title: Text Data Augmentation for Large Language Models: A Comprehensive Survey of Methods, Challenges, and Opportunities
- Authors: Yaping Chai, Haoran Xie, Joe S. Qin,
- Abstract summary: Large language models (LLMs) trained on extensive corpora have prominent text generation capabilities.
Recent promising retrieval-based techniques further improve the expressive performance of LLMs in data augmentation.
- Score: 3.1394848827666544
- License:
- Abstract: The increasing size and complexity of pre-trained language models have demonstrated superior performance in many applications, but they usually require large training datasets to be adequately trained. Insufficient training sets could unexpectedly make the model overfit and fail to cope with complex tasks. Large language models (LLMs) trained on extensive corpora have prominent text generation capabilities, which improve the quality and quantity of data and play a crucial role in data augmentation. Specifically, distinctive prompt templates are given in personalised tasks to guide LLMs in generating the required content. Recent promising retrieval-based techniques further improve the expressive performance of LLMs in data augmentation by introducing external knowledge to enable them to produce more grounded-truth data. This survey provides an in-depth analysis of data augmentation in LLMs, classifying the techniques into Simple Augmentation, Prompt-based Augmentation, Retrieval-based Augmentation and Hybrid Augmentation. We summarise the post-processing approaches in data augmentation, which contributes significantly to refining the augmented data and enabling the model to filter out unfaithful content. Then, we provide the common tasks and evaluation metrics. Finally, we introduce existing challenges and future opportunities that could bring further improvement to data augmentation.
Related papers
- Forewarned is Forearmed: Leveraging LLMs for Data Synthesis through Failure-Inducing Exploration [90.41908331897639]
Large language models (LLMs) have significantly benefited from training on diverse, high-quality task-specific data.
We present a novel approach, ReverseGen, designed to automatically generate effective training samples.
arXiv Detail & Related papers (2024-10-22T06:43:28Z) - A Survey on Data Synthesis and Augmentation for Large Language Models [35.59526251210408]
This paper reviews and summarizes data generation techniques throughout the lifecycle of Large Language Models.
We discuss the current constraints faced by these methods and investigate potential pathways for future development and research.
arXiv Detail & Related papers (2024-10-16T16:12:39Z) - SIaM: Self-Improving Code-Assisted Mathematical Reasoning of Large Language Models [54.78329741186446]
We propose a novel paradigm that uses a code-based critic model to guide steps including question-code data construction, quality control, and complementary evaluation.
Experiments across both in-domain and out-of-domain benchmarks in English and Chinese demonstrate the effectiveness of the proposed paradigm.
arXiv Detail & Related papers (2024-08-28T06:33:03Z) - Learning to Plan for Retrieval-Augmented Large Language Models from Knowledge Graphs [59.76268575344119]
We introduce a novel framework for enhancing large language models' (LLMs) planning capabilities by using planning data derived from knowledge graphs (KGs)
LLMs fine-tuned with KG data have improved planning capabilities, better equipping them to handle complex QA tasks that involve retrieval.
arXiv Detail & Related papers (2024-06-20T13:07:38Z) - Empowering Large Language Models for Textual Data Augmentation [23.483960932358396]
Large language models (LLMs) can potentially act as a powerful tool for textual data augmentation.
This work proposes a new solution, which can automatically generate a large pool of augmentation instructions and select the most suitable task-informed instructions.
Empirically, the proposed approach consistently generates augmented data with better quality compared to non-LLM and LLM-based data augmentation methods.
arXiv Detail & Related papers (2024-04-26T18:04:25Z) - Data Augmentation using Large Language Models: Data Perspectives, Learning Paradigms and Challenges [47.45993726498343]
Data augmentation (DA) has emerged as a pivotal technique for enhancing model performance by diversifying training examples without the need for additional data collection.
This survey explores the transformative impact of large language models (LLMs) on DA, particularly addressing the unique challenges and opportunities they present in the context of natural language processing (NLP) and beyond.
arXiv Detail & Related papers (2024-03-05T14:11:54Z) - LLM-DA: Data Augmentation via Large Language Models for Few-Shot Named
Entity Recognition [67.96794382040547]
$LLM-DA$ is a novel data augmentation technique based on large language models (LLMs) for the few-shot NER task.
Our approach involves employing 14 contextual rewriting strategies, designing entity replacements of the same type, and incorporating noise injection to enhance robustness.
arXiv Detail & Related papers (2024-02-22T14:19:56Z) - A Survey on Data Augmentation in Large Model Era [16.05117556207015]
Large models, encompassing large language and diffusion models, have shown exceptional promise in approximating human-level intelligence.
With continuous updates to these models, the existing reservoir of high-quality data may soon be depleted.
This paper offers an exhaustive review of large model-driven data augmentation methods.
arXiv Detail & Related papers (2024-01-27T14:19:33Z) - Curated LLM: Synergy of LLMs and Data Curation for tabular augmentation in low-data regimes [57.62036621319563]
We introduce CLLM, which leverages the prior knowledge of Large Language Models (LLMs) for data augmentation in the low-data regime.
We demonstrate the superior performance of CLLM in the low-data regime compared to conventional generators.
arXiv Detail & Related papers (2023-12-19T12:34:46Z) - Phased Data Augmentation for Training a Likelihood-Based Generative Model with Limited Data [0.0]
Generative models excel in creating realistic images, yet their dependency on extensive datasets for training presents significant challenges.
Current data-efficient methods largely focus on GAN architectures, leaving a gap in training other types of generative models.
"phased data augmentation" is a novel technique that addresses this gap by optimizing training in limited data scenarios without altering the inherent data distribution.
arXiv Detail & Related papers (2023-05-22T03:38:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.