Synthetic Data Generation with Large Language Models for Text
Classification: Potential and Limitations
- URL: http://arxiv.org/abs/2310.07849v2
- Date: Fri, 13 Oct 2023 01:31:59 GMT
- Title: Synthetic Data Generation with Large Language Models for Text
Classification: Potential and Limitations
- Authors: Zhuoyan Li, Hangxiao Zhu, Zhuoran Lu, Ming Yin
- Abstract summary: We study how the performance of models trained on synthetic data may vary with the subjectivity of classification.
Our results indicate that subjectivity, at both the task level and instance level, is negatively associated with the performance of the model trained on synthetic data.
- Score: 21.583825474908334
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The collection and curation of high-quality training data is crucial for
developing text classification models with superior performance, but it is
often associated with significant costs and time investment. Researchers have
recently explored using large language models (LLMs) to generate synthetic
datasets as an alternative approach. However, the effectiveness of the
LLM-generated synthetic data in supporting model training is inconsistent
across different classification tasks. To better understand factors that
moderate the effectiveness of the LLM-generated synthetic data, in this study,
we look into how the performance of models trained on these synthetic data may
vary with the subjectivity of classification. Our results indicate that
subjectivity, at both the task level and instance level, is negatively
associated with the performance of the model trained on synthetic data. We
conclude by discussing the implications of our work on the potential and
limitations of leveraging LLM for synthetic data generation.
Related papers
- Not All LLM-Generated Data Are Equal: Rethinking Data Weighting in Text Classification [7.357494019212501]
We propose efficient weighted-loss approaches to align synthetic data with real-world distribution.
We empirically assessed the effectiveness of our method on multiple text classification tasks.
arXiv Detail & Related papers (2024-10-28T20:53:49Z) - Forewarned is Forearmed: Leveraging LLMs for Data Synthesis through Failure-Inducing Exploration [90.41908331897639]
Large language models (LLMs) have significantly benefited from training on diverse, high-quality task-specific data.
We present a novel approach, ReverseGen, designed to automatically generate effective training samples.
arXiv Detail & Related papers (2024-10-22T06:43:28Z) - On the Diversity of Synthetic Data and its Impact on Training Large Language Models [34.00031258223175]
Large Language Models (LLMs) have accentuated the need for diverse, high-quality pre-training data.
Synthetic data emerges as a viable solution to the challenges of data scarcity and inaccessibility.
We study the downstream effects of synthetic data diversity during both the pre-training and fine-tuning stages.
arXiv Detail & Related papers (2024-10-19T22:14:07Z) - How Hard is this Test Set? NLI Characterization by Exploiting Training Dynamics [49.9329723199239]
We propose a method for the automated creation of a challenging test set without relying on the manual construction of artificial and unrealistic examples.
We categorize the test set of popular NLI datasets into three difficulty levels by leveraging methods that exploit training dynamics.
When our characterization method is applied to the training set, models trained with only a fraction of the data achieve comparable performance to those trained on the full dataset.
arXiv Detail & Related papers (2024-10-04T13:39:21Z) - Data Generation Using Large Language Models for Text Classification: An Empirical Case Study [15.447491854250227]
We use natural language understanding (NLU) models trained on synthetic data to assess the quality of synthetic data from different generation approaches.
This work provides an empirical analysis of the impact of these factors and offers recommendations for better data generation practices.
arXiv Detail & Related papers (2024-06-27T21:41:43Z) - Unveiling the Flaws: Exploring Imperfections in Synthetic Data and Mitigation Strategies for Large Language Models [89.88010750772413]
Synthetic data has been proposed as a solution to address the issue of high-quality data scarcity in the training of large language models (LLMs)
Our work delves into these specific flaws associated with question-answer (Q-A) pairs, a prevalent type of synthetic data, and presents a method based on unlearning techniques to mitigate these flaws.
Our work has yielded key insights into the effective use of synthetic data, aiming to promote more robust and efficient LLM training.
arXiv Detail & Related papers (2024-06-18T08:38:59Z) - Best Practices and Lessons Learned on Synthetic Data [83.63271573197026]
The success of AI models relies on the availability of large, diverse, and high-quality datasets.
Synthetic data has emerged as a promising solution by generating artificial data that mimics real-world patterns.
arXiv Detail & Related papers (2024-04-11T06:34:17Z) - Reimagining Synthetic Tabular Data Generation through Data-Centric AI: A
Comprehensive Benchmark [56.8042116967334]
Synthetic data serves as an alternative in training machine learning models.
ensuring that synthetic data mirrors the complex nuances of real-world data is a challenging task.
This paper explores the potential of integrating data-centric AI techniques to guide the synthetic data generation process.
arXiv Detail & Related papers (2023-10-25T20:32:02Z) - Does Synthetic Data Make Large Language Models More Efficient? [0.0]
This paper explores the nuances of synthetic data generation in NLP.
We highlight its advantages, including data augmentation potential and the introduction of structured variety.
We demonstrate the impact of template-based synthetic data on the performance of modern transformer models.
arXiv Detail & Related papers (2023-10-11T19:16:09Z) - Synthetic data, real errors: how (not) to publish and use synthetic data [86.65594304109567]
We show how the generative process affects the downstream ML task.
We introduce Deep Generative Ensemble (DGE) to approximate the posterior distribution over the generative process model parameters.
arXiv Detail & Related papers (2023-05-16T07:30:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.