ProGen: Progressive Zero-shot Dataset Generation via In-context Feedback
- URL: http://arxiv.org/abs/2210.12329v1
- Date: Sat, 22 Oct 2022 02:07:10 GMT
- Title: ProGen: Progressive Zero-shot Dataset Generation via In-context Feedback
- Authors: Jiacheng Ye, Jiahui Gao, Jiangtao Feng, Zhiyong Wu, Tao Yu, Lingpeng
Kong
- Abstract summary: We propose a progressive zero-shot dataset generation framework, ProGen, to guide the generation of new training data.
We show ProGen achieves on-par or superior performance with only 1% synthetic dataset size.
- Score: 21.168991554983815
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, dataset-generation-based zero-shot learning has shown promising
results by training a task-specific model with a dataset synthesized from large
pre-trained language models (PLMs). The final task-specific model often
achieves compatible or even better performance than PLMs under the zero-shot
setting, with orders of magnitude fewer parameters. However, synthetic datasets
have their drawbacks. They have long been suffering from low-quality issues
(e.g., low informativeness and redundancy). This explains why the massive
synthetic data does not lead to better performance -- a scenario we would
expect in the human-labeled data. To improve the quality of dataset synthesis,
we propose a progressive zero-shot dataset generation framework, ProGen, which
leverages the feedback from the task-specific model to guide the generation of
new training data via in-context examples. Extensive experiments on five text
classification datasets demonstrate the effectiveness of the proposed approach.
We also show ProGen achieves on-par or superior performance with only 1\%
synthetic dataset size compared to baseline methods without in-context
feedback.
Related papers
- Generating Realistic Tabular Data with Large Language Models [49.03536886067729]
Large language models (LLM) have been used for diverse tasks, but do not capture the correct correlation between the features and the target variable.
We propose a LLM-based method with three important improvements to correctly capture the ground-truth feature-class correlation in the real data.
Our experiments show that our method significantly outperforms 10 SOTA baselines on 20 datasets in downstream tasks.
arXiv Detail & Related papers (2024-10-29T04:14:32Z) - Retrieval-Augmented Data Augmentation for Low-Resource Domain Tasks [66.87070857705994]
In low-resource settings, the amount of seed data samples to use for data augmentation is very small.
We propose a novel method that augments training data by incorporating a wealth of examples from other datasets.
This approach can ensure that the generated data is not only relevant but also more diverse than what could be achieved using the limited seed data alone.
arXiv Detail & Related papers (2024-02-21T02:45:46Z) - Let's Synthesize Step by Step: Iterative Dataset Synthesis with Large
Language Models by Extrapolating Errors from Small Models [69.76066070227452]
*Data Synthesis* is a promising way to train a small model with very little labeled data.
We propose *Synthesis Step by Step* (**S3**), a data synthesis framework that shrinks this distribution gap.
Our approach improves the performance of a small model by reducing the gap between the synthetic dataset and the real data.
arXiv Detail & Related papers (2023-10-20T17:14:25Z) - Feedback-guided Data Synthesis for Imbalanced Classification [10.836265321046561]
We introduce a framework for augmenting static datasets with useful synthetic samples.
We find that the samples must be close to the support of the real data of the task at hand, and be sufficiently diverse.
On ImageNet-LT, we achieve state-of-the-art results, with over 4 percent improvement on underrepresented classes.
arXiv Detail & Related papers (2023-09-29T21:47:57Z) - Synthetic data, real errors: how (not) to publish and use synthetic data [86.65594304109567]
We show how the generative process affects the downstream ML task.
We introduce Deep Generative Ensemble (DGE) to approximate the posterior distribution over the generative process model parameters.
arXiv Detail & Related papers (2023-05-16T07:30:29Z) - ZeroGen$^+$: Self-Guided High-Quality Data Generation in Efficient
Zero-Shot Learning [97.2907428983142]
ZeroGen attempts to purely use PLM to generate data and train a tiny model without relying on task-specific annotation.
We propose a noise-robust bi-level re-weighting framework which is able to learn the per-sample weights measuring the data quality without requiring any gold data.
arXiv Detail & Related papers (2022-05-25T11:38:48Z) - ZeroGen: Efficient Zero-shot Learning via Dataset Generation [28.454620513642034]
We study a flexible and efficient zero-short learning method, ZeroGen.
Given a zero-shot task, we first generate a dataset from scratch using PLMs in an unsupervised manner.
Experiments and analysis on different NLP tasks, namely, text classification, question answering, and natural language inference, show the effectiveness of ZeroGen.
arXiv Detail & Related papers (2022-02-16T08:18:02Z) - Improving Zero and Few-Shot Abstractive Summarization with Intermediate
Fine-tuning and Data Augmentation [101.26235068460551]
Models pretrained with self-supervised objectives on large text corpora achieve state-of-the-art performance on English text summarization tasks.
Models are typically fine-tuned on hundreds of thousands of data points, an infeasible requirement when applying summarization to new, niche domains.
We introduce a novel and generalizable method, called WikiTransfer, for fine-tuning pretrained models for summarization in an unsupervised, dataset-specific manner.
arXiv Detail & Related papers (2020-10-24T08:36:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.