Related papers: On LLMs-Driven Synthetic Data Generation, Curation, and Evaluation: A Survey

On LLMs-Driven Synthetic Data Generation, Curation, and Evaluation: A Survey

URL: http://arxiv.org/abs/2406.15126v1
Date: Fri, 14 Jun 2024 07:47:09 GMT
Title: On LLMs-Driven Synthetic Data Generation, Curation, and Evaluation: A Survey
Authors: Lin Long, Rui Wang, Ruixuan Xiao, Junbo Zhao, Xiao Ding, Gang Chen, Haobo Wang,
Abstract summary: Large Language Models (LLMs) offer a data-centric solution to alleviate the limitations of real-world data with synthetic data generation. This paper provides an organization of relevant studies based on a generic workflow of synthetic data generation.
Score: 26.670507323784616
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Within the evolving landscape of deep learning, the dilemma of data quantity and quality has been a long-standing problem. The recent advent of Large Language Models (LLMs) offers a data-centric solution to alleviate the limitations of real-world data with synthetic data generation. However, current investigations into this field lack a unified framework and mostly stay on the surface. Therefore, this paper provides an organization of relevant studies based on a generic workflow of synthetic data generation. By doing so, we highlight the gaps within existing research and outline prospective avenues for future study. This work aims to shepherd the academic and industrial communities towards deeper, more methodical inquiries into the capabilities and applications of LLMs-driven synthetic data generation.

Related papers

Understanding the Influence of Synthetic Data for Text Embedders [52.04771455432998]
We first reproduce and publicly release the synthetic data proposed by Wang et al.<n>We critically examine where exactly synthetic data improves model generalization.<n>Our findings highlight the limitations of current synthetic data approaches for building general-purpose embedders.
arXiv Detail & Related papers (2025-09-07T19:28:52Z)
More Parameters Than Populations: A Systematic Literature Review of Large Language Models within Survey Research [0.7699714865575188]
Large Language Models (LLMs) bring new technological challenges and prerequisites in order to fully harness their potential.<n>We report work-in-progress on a systematic literature review based on keyword searches from multiple large-scale databases.<n>We discuss selected examples of potential use cases for LLMs as well as its pitfalls based on examples from existing literature.
arXiv Detail & Related papers (2025-09-03T15:15:31Z)
Generative Models for Synthetic Data: Transforming Data Mining in the GenAI Era [49.46005489386284]
This tutorial introduces the foundations and latest advances in synthetic data generation.<n> Attendees will gain actionable insights into leveraging generative synthetic data to enhance data mining research and practice.
arXiv Detail & Related papers (2025-08-27T05:04:07Z)
A Comprehensive Survey of Synthetic Tabular Data Generation [27.112327373017457]
Tabular data is one of the most prevalent and critical data formats across diverse real-world applications. It is often constrained by challenges such as data scarcity, privacy concerns, and class imbalance. Synthetic data generation has emerged as a promising solution, leveraging generative models to learn the distribution of real datasets.
arXiv Detail & Related papers (2025-04-23T08:33:34Z)
A Survey on Data Synthesis and Augmentation for Large Language Models [35.59526251210408]
This paper reviews and summarizes data generation techniques throughout the lifecycle of Large Language Models. We discuss the current constraints faced by these methods and investigate potential pathways for future development and research.
arXiv Detail & Related papers (2024-10-16T16:12:39Z)
Data-Centric AI in the Age of Large Language Models [51.20451986068925]
This position paper proposes a data-centric viewpoint of AI research, focusing on large language models (LLMs) We make the key observation that data is instrumental in the developmental (e.g., pretraining and fine-tuning) and inferential stages (e.g., in-context learning) of LLMs. We identify four specific scenarios centered around data, covering data-centric benchmarks and data curation, data attribution, knowledge transfer, and inference contextualization.
arXiv Detail & Related papers (2024-06-20T16:34:07Z)
Best Practices and Lessons Learned on Synthetic Data [83.63271573197026]
The success of AI models relies on the availability of large, diverse, and high-quality datasets. Synthetic data has emerged as a promising solution by generating artificial data that mimics real-world patterns.
arXiv Detail & Related papers (2024-04-11T06:34:17Z)
Generative AI for Synthetic Data Generation: Methods, Challenges and the Future [12.506811635026907]
The recent surge in research focused on generating synthetic data from large language models (LLMs) This paper delves into advanced technologies that leverage these gigantic LLMs for the generation of task-specific training data.
arXiv Detail & Related papers (2024-03-07T03:38:44Z)
DataDreamer: A Tool for Synthetic Data Generation and Reproducible LLM Workflows [72.40917624485822]
We introduce DataDreamer, an open source Python library that allows researchers to implement powerful large language models. DataDreamer also helps researchers adhere to best practices that we propose to encourage open science.
arXiv Detail & Related papers (2024-02-16T00:10:26Z)
Reimagining Synthetic Tabular Data Generation through Data-Centric AI: A Comprehensive Benchmark [56.8042116967334]
Synthetic data serves as an alternative in training machine learning models. ensuring that synthetic data mirrors the complex nuances of real-world data is a challenging task. This paper explores the potential of integrating data-centric AI techniques to guide the synthetic data generation process.
arXiv Detail & Related papers (2023-10-25T20:32:02Z)
Synthetic Data Generation with Large Language Models for Text Classification: Potential and Limitations [21.583825474908334]
We study how the performance of models trained on synthetic data may vary with the subjectivity of classification. Our results indicate that subjectivity, at both the task level and instance level, is negatively associated with the performance of the model trained on synthetic data.
arXiv Detail & Related papers (2023-10-11T19:51:13Z)
Exploring the Potential of AI-Generated Synthetic Datasets: A Case Study on Telematics Data with ChatGPT [0.0]
This research delves into the construction and utilization of synthetic datasets, specifically within the telematics sphere, leveraging OpenAI's powerful language model, ChatGPT. To illustrate this data creation process, a hands-on case study is conducted, focusing on the generation of a synthetic telematics dataset.
arXiv Detail & Related papers (2023-06-23T15:15:13Z)
TSGM: A Flexible Framework for Generative Modeling of Synthetic Time Series [61.436361263605114]
Time series data are often scarce or highly sensitive, which precludes the sharing of data between researchers and industrial organizations. We introduce Time Series Generative Modeling (TSGM), an open-source framework for the generative modeling of synthetic time series.
arXiv Detail & Related papers (2023-05-19T10:11:21Z)
Beyond Privacy: Navigating the Opportunities and Challenges of Synthetic Data [91.52783572568214]
Synthetic data may become a dominant force in the machine learning world, promising a future where datasets can be tailored to individual needs. We discuss which fundamental challenges the community needs to overcome for wider relevance and application of synthetic data.
arXiv Detail & Related papers (2023-04-07T16:38:40Z)
Machine Learning for Synthetic Data Generation: A Review [23.073056971997715]
This paper reviews existing studies that employ machine learning models for the purpose of generating synthetic data. The review encompasses various perspectives, starting with the applications of synthetic data generation, spanning computer vision, speech, natural language processing, healthcare, and business domains. The paper also addresses the crucial aspects of privacy and fairness concerns related to synthetic data generation.
arXiv Detail & Related papers (2023-02-08T13:59:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.