Related papers: Data Augmentation using Large Language Models: Data Perspectives, Learning Paradigms and Challenges

Data Augmentation using Large Language Models: Data Perspectives, Learning Paradigms and Challenges

URL: http://arxiv.org/abs/2403.02990v4
Date: Tue, 2 Jul 2024 07:59:40 GMT
Title: Data Augmentation using Large Language Models: Data Perspectives, Learning Paradigms and Challenges
Authors: Bosheng Ding, Chengwei Qin, Ruochen Zhao, Tianze Luo, Xinze Li, Guizhen Chen, Wenhan Xia, Junjie Hu, Anh Tuan Luu, Shafiq Joty,
Abstract summary: Data augmentation (DA) has emerged as a pivotal technique for enhancing model performance by diversifying training examples without the need for additional data collection. This survey explores the transformative impact of large language models (LLMs) on DA, particularly addressing the unique challenges and opportunities they present in the context of natural language processing (NLP) and beyond.
Score: 47.45993726498343
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In the rapidly evolving field of large language models (LLMs), data augmentation (DA) has emerged as a pivotal technique for enhancing model performance by diversifying training examples without the need for additional data collection. This survey explores the transformative impact of LLMs on DA, particularly addressing the unique challenges and opportunities they present in the context of natural language processing (NLP) and beyond. From both data and learning perspectives, we examine various strategies that utilize LLMs for data augmentation, including a novel exploration of learning paradigms where LLM-generated data is used for diverse forms of further training. Additionally, this paper highlights the primary open challenges faced in this domain, ranging from controllable data augmentation to multi-modal data augmentation. This survey highlights a paradigm shift introduced by LLMs in DA, and aims to serve as a comprehensive guide for researchers and practitioners.

Related papers

LLM Post-Training: A Deep Dive into Reasoning Large Language Models [131.10969986056]
Large Language Models (LLMs) have transformed the natural language processing landscape and brought to life diverse applications. Post-training methods enable LLMs to refine their knowledge, improve reasoning, enhance factual accuracy, and align more effectively with user intents and ethical considerations.
arXiv Detail & Related papers (2025-02-28T18:59:54Z)
From Selection to Generation: A Survey of LLM-based Active Learning [153.8110509961261]
Large Language Models (LLMs) have been employed for generating entirely new data instances and providing more cost-effective annotations. This survey aims to serve as an up-to-date resource for researchers and practitioners seeking to gain an intuitive understanding of LLM-based AL techniques.
arXiv Detail & Related papers (2025-02-17T12:58:17Z)
Diversity as a Reward: Fine-Tuning LLMs on a Mixture of Domain-Undetermined Data [36.277423093218275]
We study the role of data diversity in enhancing the overall abilities of large language models (LLMs) We propose a new method that gives the LLM a dual identity: an output model to cognitively probe and select data based on diversity reward, as well as an input model to be tuned with the selected data.
arXiv Detail & Related papers (2025-02-05T17:21:01Z)
Text Data Augmentation for Large Language Models: A Comprehensive Survey of Methods, Challenges, and Opportunities [3.1394848827666544]
Large language models (LLMs) trained on extensive corpora have prominent text generation capabilities. Recent promising retrieval-based techniques further improve the expressive performance of LLMs in data augmentation.
arXiv Detail & Related papers (2025-01-31T01:50:49Z)
The Evolution of LLM Adoption in Industry Data Curation Practices [20.143297690624298]
This paper explores the evolution of large language models (LLMs) among practitioners at a large technology company. Through a series of surveys, interviews, and user studies, we provide a timely snapshot of how organizations are navigating a pivotal moment in LLM evolution.
arXiv Detail & Related papers (2024-12-20T17:34:16Z)
Forewarned is Forearmed: Leveraging LLMs for Data Synthesis through Failure-Inducing Exploration [90.41908331897639]
Large language models (LLMs) have significantly benefited from training on diverse, high-quality task-specific data. We present a novel approach, ReverseGen, designed to automatically generate effective training samples.
arXiv Detail & Related papers (2024-10-22T06:43:28Z)
RA-BLIP: Multimodal Adaptive Retrieval-Augmented Bootstrapping Language-Image Pre-training [55.54020926284334]
Multimodal Large Language Models (MLLMs) have recently received substantial interest, which shows their emerging potential as general-purpose models for various vision-language tasks. Retrieval augmentation techniques have proven to be effective plugins for both LLMs and MLLMs. In this study, we propose multimodal adaptive Retrieval-Augmented Bootstrapping Language-Image Pre-training (RA-BLIP), a novel retrieval-augmented framework for various MLLMs.
arXiv Detail & Related papers (2024-10-18T03:45:19Z)
The Synergy between Data and Multi-Modal Large Language Models: A Survey from Co-Development Perspective [53.48484062444108]
We find that the development of models and data is not two separate paths but rather interconnected. On the one hand, vaster and higher-quality data contribute to better performance of MLLMs; on the other hand, MLLMs can facilitate the development of data. To promote the data-model co-development for MLLM community, we systematically review existing works related to MLLMs from the data-model co-development perspective.
arXiv Detail & Related papers (2024-07-11T15:08:11Z)
A Survey of Multimodal Large Language Model from A Data-centric Perspective [46.57232264950785]
Multimodal large language models (MLLMs) enhance the capabilities of standard large language models by integrating and processing data from multiple modalities. Data plays a pivotal role in the development and refinement of these models.
arXiv Detail & Related papers (2024-05-26T17:31:21Z)
Large Language Models for Data Annotation: A Survey [49.8318827245266]
The emergence of advanced Large Language Models (LLMs) presents an unprecedented opportunity to automate the complicated process of data annotation. This survey includes an in-depth taxonomy of data types that LLMs can annotate, a review of learning strategies for models utilizing LLM-generated annotations, and a detailed discussion of the primary challenges and limitations associated with using LLMs for data annotation.
arXiv Detail & Related papers (2024-02-21T00:44:04Z)
Large Language Models Meet Computer Vision: A Brief Survey [0.0]
Large Language Models (LLMs) and Computer Vision (CV) have emerged as a pivotal area of research, driving significant advancements in the field of Artificial Intelligence (AI) This survey paper delves into the latest progressions in the domain of transformers, emphasizing their potential to revolutionize Vision Transformers (ViTs) and LLMs. The survey is concluded by highlighting open directions in the field, suggesting potential venues for future research and development.
arXiv Detail & Related papers (2023-11-28T10:39:19Z)
How to Bridge the Gap between Modalities: A Comprehensive Survey on Multimodal Large Language Model [12.890344377484759]
This review paper explores Multimodal Large Language Models (MLLMs) MLLMs integrate Large Language Models (LLMs) like GPT-4 to handle multimodal data such as text and vision. Choosing the appropriate modality alignment method is crucial, as improper methods might require more parameters with limited performance improvement.
arXiv Detail & Related papers (2023-11-10T09:51:24Z)
Large Language Models as Data Preprocessors [9.99065004972981]
Large Language Models (LLMs) have marked a significant advancement in artificial intelligence. This study explores their potential in data preprocessing, a critical stage in data mining and analytics applications. We propose an LLM-based framework for data preprocessing, which integrates cutting-edge prompt engineering techniques.
arXiv Detail & Related papers (2023-08-30T23:28:43Z)
Improving Small Language Models on PubMedQA via Generative Data Augmentation [4.96649519549027]
Large Language Models (LLMs) have made remarkable advancements in the field of natural language processing. Small Language Models (SLMs) are known for their efficiency, but they often struggle with limited capacity and training data. We introduce a novel method aimed at improving SLMs in the medical domain using LLM-based generative data augmentation.
arXiv Detail & Related papers (2023-05-12T23:49:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.