Large Language Model as Attributed Training Data Generator: A Tale of
Diversity and Bias
- URL: http://arxiv.org/abs/2306.15895v2
- Date: Wed, 18 Oct 2023 02:07:12 GMT
- Title: Large Language Model as Attributed Training Data Generator: A Tale of
Diversity and Bias
- Authors: Yue Yu, Yuchen Zhuang, Jieyu Zhang, Yu Meng, Alexander Ratner, Ranjay
Krishna, Jiaming Shen, Chao Zhang
- Abstract summary: Large language models (LLMs) have been recently leveraged as training data generators for various natural language processing (NLP) tasks.
We investigate training data generation with diversely attributed prompts, which have the potential to yield diverse and attributed generated data.
We show that attributed prompts outperform simple class-conditional prompts in terms of the resulting model's performance.
- Score: 92.41919689753051
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models (LLMs) have been recently leveraged as training data
generators for various natural language processing (NLP) tasks. While previous
research has explored different approaches to training models using generated
data, they generally rely on simple class-conditional prompts, which may limit
the diversity of the generated data and inherit systematic biases of LLM. Thus,
we investigate training data generation with diversely attributed prompts
(e.g., specifying attributes like length and style), which have the potential
to yield diverse and attributed generated data. Our investigation focuses on
datasets with high cardinality and diverse domains, wherein we demonstrate that
attributed prompts outperform simple class-conditional prompts in terms of the
resulting model's performance. Additionally, we present a comprehensive
empirical study on data generation encompassing vital aspects like bias,
diversity, and efficiency, and highlight three key observations: firstly,
synthetic datasets generated by simple prompts exhibit significant biases, such
as regional bias; secondly, attribute diversity plays a pivotal role in
enhancing model performance; lastly, attributed prompts achieve the performance
of simple class-conditional prompts while utilizing only 5\% of the querying
cost of ChatGPT associated with the latter. The data and code are available on
\url{https://github.com/yueyu1030/AttrPrompt}.
Related papers
- Generating Realistic Tabular Data with Large Language Models [49.03536886067729]
Large language models (LLM) have been used for diverse tasks, but do not capture the correct correlation between the features and the target variable.
We propose a LLM-based method with three important improvements to correctly capture the ground-truth feature-class correlation in the real data.
Our experiments show that our method significantly outperforms 10 SOTA baselines on 20 datasets in downstream tasks.
arXiv Detail & Related papers (2024-10-29T04:14:32Z) - Forewarned is Forearmed: Leveraging LLMs for Data Synthesis through Failure-Inducing Exploration [90.41908331897639]
Large language models (LLMs) have significantly benefited from training on diverse, high-quality task-specific data.
We present a novel approach, ReverseGen, designed to automatically generate effective training samples.
arXiv Detail & Related papers (2024-10-22T06:43:28Z) - LLM-Select: Feature Selection with Large Language Models [64.5099482021597]
Large language models (LLMs) are capable of selecting the most predictive features, with performance rivaling the standard tools of data science.
Our findings suggest that LLMs may be useful not only for selecting the best features for training but also for deciding which features to collect in the first place.
arXiv Detail & Related papers (2024-07-02T22:23:40Z) - Self-Evolved Diverse Data Sampling for Efficient Instruction Tuning [47.02160072880698]
We introduce a self-evolving mechanism that allows the model itself to actively sample subsets that are equally or even more effective.
The key to our data sampling technique lies in the enhancement of diversity in the chosen subsets.
Extensive experiments across three datasets and benchmarks demonstrate the effectiveness of DiverseEvol.
arXiv Detail & Related papers (2023-11-14T14:10:40Z) - AnnoLLM: Making Large Language Models to Be Better Crowdsourced Annotators [98.11286353828525]
GPT-3.5 series models have demonstrated remarkable few-shot and zero-shot ability across various NLP tasks.
We propose AnnoLLM, which adopts a two-step approach, explain-then-annotate.
We build the first conversation-based information retrieval dataset employing AnnoLLM.
arXiv Detail & Related papers (2023-03-29T17:03:21Z) - Meta-learning Pathologies from Radiology Reports using Variance Aware
Prototypical Networks [3.464871689508835]
We propose a simple extension of the Prototypical Networks for few-shot text classification.
Our main idea is to replace the class prototypes by Gaussians and introduce a regularization term that encourages the examples to be clustered near the appropriate class centroids.
arXiv Detail & Related papers (2022-10-22T05:22:29Z) - CHALLENGER: Training with Attribution Maps [63.736435657236505]
We show that utilizing attribution maps for training neural networks can improve regularization of models and thus increase performance.
In particular, we show that our generic domain-independent approach yields state-of-the-art results in vision, natural language processing and on time series tasks.
arXiv Detail & Related papers (2022-05-30T13:34:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.