Generating Datasets with Pretrained Language Models
- URL: http://arxiv.org/abs/2104.07540v2
- Date: Sat, 17 Apr 2021 16:05:21 GMT
- Title: Generating Datasets with Pretrained Language Models
- Authors: Timo Schick and Hinrich Sch\"utze
- Abstract summary: We show how large language models can be leveraged to obtain high-quality embeddings without requiring labeled data, finetuning or modifications to the pretraining objective.
We utilize the generative abilities of PLMs to generate entire datasets of labeled text pairs from scratch, which can then be used for regular finetuning of much smaller models.
- Score: 12.919486518128734
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: To obtain high-quality sentence embeddings from pretrained language models
(PLMs), they must either be augmented with additional pretraining objectives or
finetuned on a large set of labeled text pairs. While the latter approach
typically outperforms the former, it requires great human effort to generate
suitable datasets of sufficient size. In this paper, we show how large PLMs can
be leveraged to obtain high-quality embeddings without requiring any labeled
data, finetuning or modifications to the pretraining objective: We utilize the
generative abilities of PLMs to generate entire datasets of labeled text pairs
from scratch, which can then be used for regular finetuning of much smaller
models. Our fully unsupervised approach outperforms strong baselines on several
English semantic textual similarity datasets.
Related papers
- Generating Realistic Tabular Data with Large Language Models [49.03536886067729]
Large language models (LLM) have been used for diverse tasks, but do not capture the correct correlation between the features and the target variable.
We propose a LLM-based method with three important improvements to correctly capture the ground-truth feature-class correlation in the real data.
Our experiments show that our method significantly outperforms 10 SOTA baselines on 20 datasets in downstream tasks.
arXiv Detail & Related papers (2024-10-29T04:14:32Z) - Improving Text Embeddings with Large Language Models [59.930513259982725]
We introduce a novel and simple method for obtaining high-quality text embeddings using only synthetic data and less than 1k training steps.
We leverage proprietary LLMs to generate diverse synthetic data for hundreds of thousands of text embedding tasks across 93 languages.
Experiments demonstrate that our method achieves strong performance on highly competitive text embedding benchmarks without using any labeled data.
arXiv Detail & Related papers (2023-12-31T02:13:18Z) - A Simple yet Efficient Ensemble Approach for AI-generated Text Detection [0.5840089113969194]
Large Language Models (LLMs) have demonstrated remarkable capabilities in generating text that closely resembles human writing.
It is essential to build automated approaches capable of distinguishing between artificially generated text and human-authored text.
We propose a simple yet efficient solution by ensembling predictions from multiple constituent LLMs.
arXiv Detail & Related papers (2023-11-06T13:11:02Z) - Synthetic Data Generation in Low-Resource Settings via Fine-Tuning of
Large Language Models [15.991777903345575]
Large language models can generalize to novel downstream tasks with relatively few labeled examples.
Alternatively, smaller models can solve specific tasks if fine-tuned with enough labeled examples.
We study synthetic data generation of fine-tuning training data via fine-tuned teacher LLMs to improve the downstream performance of much smaller models.
arXiv Detail & Related papers (2023-10-02T11:49:05Z) - ReGen: Zero-Shot Text Classification via Training Data Generation with
Progressive Dense Retrieval [22.882301169283323]
We propose a retrieval-enhanced framework to create training data from a general-domain unlabeled corpus.
Experiments on nine datasets demonstrate that REGEN achieves 4.3% gain over the strongest baselines and saves around 70% of the time compared to baselines using large NLG models.
arXiv Detail & Related papers (2023-05-18T04:30:09Z) - Generation-driven Contrastive Self-training for Zero-shot Text Classification with Instruction-following LLM [31.25193238045053]
We introduce a novel method, namely GenCo, which leverages the strong generative power of large language models to assist in training a smaller language model.
In our method, an LLM plays an important role in the self-training loop of a smaller model in two important ways.
It helps crafting additional high-quality training pairs, by rewriting input texts conditioned on predicted labels.
arXiv Detail & Related papers (2023-04-24T07:35:38Z) - AnnoLLM: Making Large Language Models to Be Better Crowdsourced Annotators [98.11286353828525]
GPT-3.5 series models have demonstrated remarkable few-shot and zero-shot ability across various NLP tasks.
We propose AnnoLLM, which adopts a two-step approach, explain-then-annotate.
We build the first conversation-based information retrieval dataset employing AnnoLLM.
arXiv Detail & Related papers (2023-03-29T17:03:21Z) - Curriculum-Based Self-Training Makes Better Few-Shot Learners for
Data-to-Text Generation [56.98033565736974]
We propose Curriculum-Based Self-Training (CBST) to leverage unlabeled data in a rearranged order determined by the difficulty of text generation.
Our method can outperform fine-tuning and task-adaptive pre-training methods, and achieve state-of-the-art performance in the few-shot setting of data-to-text generation.
arXiv Detail & Related papers (2022-06-06T16:11:58Z) - SDA: Improving Text Generation with Self Data Augmentation [88.24594090105899]
We propose to improve the standard maximum likelihood estimation (MLE) paradigm by incorporating a self-imitation-learning phase for automatic data augmentation.
Unlike most existing sentence-level augmentation strategies, our method is more general and could be easily adapted to any MLE-based training procedure.
arXiv Detail & Related papers (2021-01-02T01:15:57Z) - Unsupervised Paraphrasing with Pretrained Language Models [85.03373221588707]
We propose a training pipeline that enables pre-trained language models to generate high-quality paraphrases in an unsupervised setting.
Our recipe consists of task-adaptation, self-supervision, and a novel decoding algorithm named Dynamic Blocking.
We show with automatic and human evaluations that our approach achieves state-of-the-art performance on both the Quora Question Pair and the ParaNMT datasets.
arXiv Detail & Related papers (2020-10-24T11:55:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.