TarGEN: Targeted Data Generation with Large Language Models
- URL: http://arxiv.org/abs/2310.17876v2
- Date: Mon, 30 Oct 2023 19:08:56 GMT
- Title: TarGEN: Targeted Data Generation with Large Language Models
- Authors: Himanshu Gupta and Kevin Scaria and Ujjwala Anantheswaran and Shreyas
Verma and Mihir Parmar and Saurabh Arjun Sawant and Chitta Baral and Swaroop
Mishra
- Abstract summary: TarGEN is a multi-step prompting strategy for generating high-quality synthetic datasets.
We augment TarGEN with a method known as self-correction empowering LLMs to rectify inaccurately labeled instances.
A comprehensive analysis of the synthetic dataset compared to the original dataset reveals similar or higher levels of dataset complexity and diversity.
- Score: 54.1093098278564
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The rapid advancement of large language models (LLMs) has sparked interest in
data synthesis techniques, aiming to generate diverse and high-quality
synthetic datasets. However, these synthetic datasets often suffer from a lack
of diversity and added noise. In this paper, we present TarGEN, a multi-step
prompting strategy for generating high-quality synthetic datasets utilizing a
LLM. An advantage of TarGEN is its seedless nature; it does not require
specific task instances, broadening its applicability beyond task replication.
We augment TarGEN with a method known as self-correction empowering LLMs to
rectify inaccurately labeled instances during dataset creation, ensuring
reliable labels. To assess our technique's effectiveness, we emulate 8 tasks
from the SuperGLUE benchmark and finetune various language models, including
encoder-only, encoder-decoder, and decoder-only models on both synthetic and
original training sets. Evaluation on the original test set reveals that models
trained on datasets generated by TarGEN perform approximately 1-2% points
better than those trained on original datasets (82.84% via syn. vs. 81.12% on
og. using Flan-T5). When incorporating instruction tuning, the performance
increases to 84.54% on synthetic data vs. 81.49% on original data by Flan-T5. A
comprehensive analysis of the synthetic dataset compared to the original
dataset reveals that the synthetic dataset demonstrates similar or higher
levels of dataset complexity and diversity. Furthermore, the synthetic dataset
displays a bias level that aligns closely with the original dataset. Finally,
when pre-finetuned on our synthetic SuperGLUE dataset, T5-3B yields impressive
results on the OpenLLM leaderboard, surpassing the model trained on the
Self-Instruct dataset by 4.14% points. We hope that TarGEN can be helpful for
quality data generation and reducing the human efforts to create complex
benchmarks.
Related papers
- SUMIE: A Synthetic Benchmark for Incremental Entity Summarization [6.149024468471498]
No existing dataset adequately tests how well language models can incrementally update entity summaries.
We introduce SUMIE, a fully synthetic dataset designed to expose real-world IES challenges.
This dataset effectively highlights problems like incorrect entity association and incomplete information presentation.
arXiv Detail & Related papers (2024-06-07T16:49:21Z) - Improving Text Embeddings with Large Language Models [59.930513259982725]
We introduce a novel and simple method for obtaining high-quality text embeddings using only synthetic data and less than 1k training steps.
We leverage proprietary LLMs to generate diverse synthetic data for hundreds of thousands of text embedding tasks across 93 languages.
Experiments demonstrate that our method achieves strong performance on highly competitive text embedding benchmarks without using any labeled data.
arXiv Detail & Related papers (2023-12-31T02:13:18Z) - Reimagining Synthetic Tabular Data Generation through Data-Centric AI: A
Comprehensive Benchmark [56.8042116967334]
Synthetic data serves as an alternative in training machine learning models.
ensuring that synthetic data mirrors the complex nuances of real-world data is a challenging task.
This paper explores the potential of integrating data-centric AI techniques to guide the synthetic data generation process.
arXiv Detail & Related papers (2023-10-25T20:32:02Z) - Let's Synthesize Step by Step: Iterative Dataset Synthesis with Large
Language Models by Extrapolating Errors from Small Models [69.76066070227452]
*Data Synthesis* is a promising way to train a small model with very little labeled data.
We propose *Synthesis Step by Step* (**S3**), a data synthesis framework that shrinks this distribution gap.
Our approach improves the performance of a small model by reducing the gap between the synthetic dataset and the real data.
arXiv Detail & Related papers (2023-10-20T17:14:25Z) - Bridging the Gap: Enhancing the Utility of Synthetic Data via
Post-Processing Techniques [7.967995669387532]
generative models have emerged as a promising solution for generating synthetic datasets that can replace or augment real-world data.
We propose three novel post-processing techniques to improve the quality and diversity of the synthetic dataset.
Experiments show that Gap Filler (GaFi) effectively reduces the gap with real-accuracy scores to an error of 2.03%, 1.78%, and 3.99% on the Fashion-MNIST, CIFAR-10, and CIFAR-100 datasets, respectively.
arXiv Detail & Related papers (2023-05-17T10:50:38Z) - Synthetic data, real errors: how (not) to publish and use synthetic data [86.65594304109567]
We show how the generative process affects the downstream ML task.
We introduce Deep Generative Ensemble (DGE) to approximate the posterior distribution over the generative process model parameters.
arXiv Detail & Related papers (2023-05-16T07:30:29Z) - Exploiting Asymmetry for Synthetic Training Data Generation: SynthIE and
the Case of Information Extraction [28.51694365908817]
This work shows that useful data can be synthetically generated even for tasks that cannot be solved directly by large language models.
We synthetically generate a dataset of 1.8M data points, establish its superior quality compared to existing datasets in a human evaluation.
arXiv Detail & Related papers (2023-03-07T18:48:55Z) - ProGen: Progressive Zero-shot Dataset Generation via In-context Feedback [21.168991554983815]
We propose a progressive zero-shot dataset generation framework, ProGen, to guide the generation of new training data.
We show ProGen achieves on-par or superior performance with only 1% synthetic dataset size.
arXiv Detail & Related papers (2022-10-22T02:07:10Z) - CAFE: Learning to Condense Dataset by Aligning Features [72.99394941348757]
We propose a novel scheme to Condense dataset by Aligning FEatures (CAFE)
At the heart of our approach is an effective strategy to align features from the real and synthetic data across various scales.
We validate the proposed CAFE across various datasets, and demonstrate that it generally outperforms the state of the art.
arXiv Detail & Related papers (2022-03-03T05:58:49Z) - On the use of automatically generated synthetic image datasets for
benchmarking face recognition [2.0196229393131726]
Recent advances in Generative Adversarial Networks (GANs) provide a pathway to replace real datasets by synthetic datasets.
Recent advances in Generative Adversarial Networks (GANs) to synthesize realistic face images provide a pathway to replace real datasets by synthetic datasets.
benchmarking results on the synthetic dataset are a good substitution, often providing error rates and system ranking similar to the benchmarking on the real dataset.
arXiv Detail & Related papers (2021-06-08T09:54:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.