ZeroGen$^+$: Self-Guided High-Quality Data Generation in Efficient
Zero-Shot Learning
- URL: http://arxiv.org/abs/2205.12679v1
- Date: Wed, 25 May 2022 11:38:48 GMT
- Title: ZeroGen$^+$: Self-Guided High-Quality Data Generation in Efficient
Zero-Shot Learning
- Authors: Jiahui Gao, Renjie Pi, Yong Lin, Hang Xu, Jiacheng Ye, Zhiyong Wu,
Xiaodan Liang, Zhenguo Li, Lingpeng Kong
- Abstract summary: ZeroGen attempts to purely use PLM to generate data and train a tiny model without relying on task-specific annotation.
We propose a noise-robust bi-level re-weighting framework which is able to learn the per-sample weights measuring the data quality without requiring any gold data.
- Score: 97.2907428983142
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Nowadays, owing to the superior capacity of the large pre-trained language
models (PLM), the PLM-based zero-shot learning has shown promising performances
on various natural language processing tasks. There are emerging interests in
further exploring the zero-shot learning potential of PLMs. Among them, ZeroGen
attempts to purely use PLM to generate data and train a tiny model without
relying on any task-specific annotation. Despite its remarkable results, we
observe that the synthesized data from PLM contains a significant portion of
samples with low quality, overfitting on such data greatly hampers the
performance of the trained model and makes it unreliable for deployment.Since
no gold data is accessible in zero-shot scenario, it is hard to perform
model/data selection to prevent overfitting to the low-quality data. To address
this problem, we propose a noise-robust bi-level re-weighting framework which
is able to learn the per-sample weights measuring the data quality without
requiring any gold data. With the learnt weights, clean subsets of different
sizes can then be sampled to train the task model. We theoretically and
empirically verify our method is able to construct synthetic dataset with good
quality. Our method yeilds a 7.1% relative improvement than ZeroGen on average
accuracy across five different established text classification tasks.
Related papers
- Forewarned is Forearmed: Leveraging LLMs for Data Synthesis through Failure-Inducing Exploration [90.41908331897639]
Large language models (LLMs) have significantly benefited from training on diverse, high-quality task-specific data.
We present a novel approach, ReverseGen, designed to automatically generate effective training samples.
arXiv Detail & Related papers (2024-10-22T06:43:28Z) - Entropy Law: The Story Behind Data Compression and LLM Performance [115.70395740286422]
We find that model performance is negatively correlated to the compression ratio of training data, which usually yields a lower training loss.
Based on the findings of the entropy law, we propose a quite efficient and universal data selection method.
We also present an interesting application of entropy law that can detect potential performance risks at the beginning of model training.
arXiv Detail & Related papers (2024-07-09T08:14:29Z) - Uncertainty Aware Learning for Language Model Alignment [97.36361196793929]
We propose uncertainty-aware learning (UAL) to improve the model alignment of different task scenarios.
We implement UAL in a simple fashion -- adaptively setting the label smoothing value of training according to the uncertainty of individual samples.
Experiments on widely used benchmarks demonstrate that our UAL significantly and consistently outperforms standard supervised fine-tuning.
arXiv Detail & Related papers (2024-06-07T11:37:45Z) - How to Train Data-Efficient LLMs [56.41105687693619]
We study data-efficient approaches for pre-training language models (LLMs)
We find that Ask-LLM and Density sampling are the best methods in their respective categories.
In our comparison of 19 samplers, involving hundreds of evaluation tasks and pre-training runs, we find that Ask-LLM and Density are the best methods in their respective categories.
arXiv Detail & Related papers (2024-02-15T02:27:57Z) - Selective In-Context Data Augmentation for Intent Detection using
Pointwise V-Information [100.03188187735624]
We introduce a novel approach based on PLMs and pointwise V-information (PVI), a metric that can measure the usefulness of a datapoint for training a model.
Our method first fine-tunes a PLM on a small seed of training data and then synthesizes new datapoints - utterances that correspond to given intents.
Our method is thus able to leverage the expressive power of large language models to produce diverse training data.
arXiv Detail & Related papers (2023-02-10T07:37:49Z) - Tuning Language Models as Training Data Generators for
Augmentation-Enhanced Few-Shot Learning [30.65315081964461]
We study few-shot learning with pretrained language models (PLMs) from a different perspective.
We first tune an autoregressive PLM on the few-shot samples and then use it as a generator to synthesize a large amount of novel training samples.
Our approach FewGen achieves an overall better result across seven classification tasks of the GLUE benchmark than existing few-shot learning methods.
arXiv Detail & Related papers (2022-11-06T06:46:47Z) - ProGen: Progressive Zero-shot Dataset Generation via In-context Feedback [21.168991554983815]
We propose a progressive zero-shot dataset generation framework, ProGen, to guide the generation of new training data.
We show ProGen achieves on-par or superior performance with only 1% synthetic dataset size.
arXiv Detail & Related papers (2022-10-22T02:07:10Z) - ZeroGen: Efficient Zero-shot Learning via Dataset Generation [28.454620513642034]
We study a flexible and efficient zero-short learning method, ZeroGen.
Given a zero-shot task, we first generate a dataset from scratch using PLMs in an unsupervised manner.
Experiments and analysis on different NLP tasks, namely, text classification, question answering, and natural language inference, show the effectiveness of ZeroGen.
arXiv Detail & Related papers (2022-02-16T08:18:02Z) - Towards Zero-Label Language Learning [20.28186484098947]
This paper explores zero-label learning in Natural Language Processing (NLP)
No human-annotated data is used anywhere during training and models are trained purely on synthetic data.
Inspired by the recent success of few-shot inference on GPT-3, we present a training data creation procedure named Unsupervised Data Generation.
arXiv Detail & Related papers (2021-09-19T19:00:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.