Related papers: Generating Datasets with Pretrained Language Models

Generating Datasets with Pretrained Language Models

URL: http://arxiv.org/abs/2104.07540v2
Date: Sat, 17 Apr 2021 16:05:21 GMT
Title: Generating Datasets with Pretrained Language Models
Authors: Timo Schick and Hinrich Sch\"utze
Abstract summary: We show how large language models can be leveraged to obtain high-quality embeddings without requiring labeled data, finetuning or modifications to the pretraining objective. We utilize the generative abilities of PLMs to generate entire datasets of labeled text pairs from scratch, which can then be used for regular finetuning of much smaller models.
Score: 12.919486518128734
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: To obtain high-quality sentence embeddings from pretrained language models (PLMs), they must either be augmented with additional pretraining objectives or finetuned on a large set of labeled text pairs. While the latter approach typically outperforms the former, it requires great human effort to generate suitable datasets of sufficient size. In this paper, we show how large PLMs can be leveraged to obtain high-quality embeddings without requiring any labeled data, finetuning or modifications to the pretraining objective: We utilize the generative abilities of PLMs to generate entire datasets of labeled text pairs from scratch, which can then be used for regular finetuning of much smaller models. Our fully unsupervised approach outperforms strong baselines on several English semantic textual similarity datasets.

Related papers

TaP: A Taxonomy-Guided Framework for Automated and Scalable Preference Data Generation [50.319535974012]
Conducting supervised fine-tuning and preference fine-tuning on large language models (LLMs) requires high-quality datasets.<n>Most available datasets for supervised and preference fine-tuning are in English.<n>We propose the underlinetextbfTaxonomy-Guided underlinetextbfPreference Data Generation framework.
arXiv Detail & Related papers (2025-06-30T15:45:28Z)
Large Language Models in the Task of Automatic Validation of Text Classifier Predictions [55.2480439325792]
Machine learning models for text classification are trained to predict a class for a given text.<n>To do this, training and validation samples must be prepared, and each text is assigned a class.<n>Human annotators are usually assigned by human annotators with different expertise levels, depending on the specific classification task.<n>This paper proposes several approaches to replace human annotators with Large Language Models.
arXiv Detail & Related papers (2025-05-24T13:19:03Z)
Generating Realistic Tabular Data with Large Language Models [49.03536886067729]
Large language models (LLM) have been used for diverse tasks, but do not capture the correct correlation between the features and the target variable. We propose a LLM-based method with three important improvements to correctly capture the ground-truth feature-class correlation in the real data. Our experiments show that our method significantly outperforms 10 SOTA baselines on 20 datasets in downstream tasks.
arXiv Detail & Related papers (2024-10-29T04:14:32Z)
Improving Text Embeddings with Large Language Models [59.930513259982725]
We introduce a novel and simple method for obtaining high-quality text embeddings using only synthetic data and less than 1k training steps. We leverage proprietary LLMs to generate diverse synthetic data for hundreds of thousands of text embedding tasks across 93 languages. Experiments demonstrate that our method achieves strong performance on highly competitive text embedding benchmarks without using any labeled data.
arXiv Detail & Related papers (2023-12-31T02:13:18Z)
A Simple yet Efficient Ensemble Approach for AI-generated Text Detection [0.5840089113969194]
Large Language Models (LLMs) have demonstrated remarkable capabilities in generating text that closely resembles human writing. It is essential to build automated approaches capable of distinguishing between artificially generated text and human-authored text. We propose a simple yet efficient solution by ensembling predictions from multiple constituent LLMs.
arXiv Detail & Related papers (2023-11-06T13:11:02Z)
Synthetic Data Generation in Low-Resource Settings via Fine-Tuning of Large Language Models [15.991777903345575]
Large language models can generalize to novel downstream tasks with relatively few labeled examples. Alternatively, smaller models can solve specific tasks if fine-tuned with enough labeled examples. We study synthetic data generation of fine-tuning training data via fine-tuned teacher LLMs to improve the downstream performance of much smaller models.
arXiv Detail & Related papers (2023-10-02T11:49:05Z)
ReGen: Zero-Shot Text Classification via Training Data Generation with Progressive Dense Retrieval [22.882301169283323]
We propose a retrieval-enhanced framework to create training data from a general-domain unlabeled corpus. Experiments on nine datasets demonstrate that REGEN achieves 4.3% gain over the strongest baselines and saves around 70% of the time compared to baselines using large NLG models.
arXiv Detail & Related papers (2023-05-18T04:30:09Z)
Generation-driven Contrastive Self-training for Zero-shot Text Classification with Instruction-following LLM [31.25193238045053]
We introduce a novel method, namely GenCo, which leverages the strong generative power of large language models to assist in training a smaller language model. In our method, an LLM plays an important role in the self-training loop of a smaller model in two important ways. It helps crafting additional high-quality training pairs, by rewriting input texts conditioned on predicted labels.
arXiv Detail & Related papers (2023-04-24T07:35:38Z)
AnnoLLM: Making Large Language Models to Be Better Crowdsourced Annotators [98.11286353828525]
GPT-3.5 series models have demonstrated remarkable few-shot and zero-shot ability across various NLP tasks. We propose AnnoLLM, which adopts a two-step approach, explain-then-annotate. We build the first conversation-based information retrieval dataset employing AnnoLLM.
arXiv Detail & Related papers (2023-03-29T17:03:21Z)
Curriculum-Based Self-Training Makes Better Few-Shot Learners for Data-to-Text Generation [56.98033565736974]
We propose Curriculum-Based Self-Training (CBST) to leverage unlabeled data in a rearranged order determined by the difficulty of text generation. Our method can outperform fine-tuning and task-adaptive pre-training methods, and achieve state-of-the-art performance in the few-shot setting of data-to-text generation.
arXiv Detail & Related papers (2022-06-06T16:11:58Z)
SDA: Improving Text Generation with Self Data Augmentation [88.24594090105899]
We propose to improve the standard maximum likelihood estimation (MLE) paradigm by incorporating a self-imitation-learning phase for automatic data augmentation. Unlike most existing sentence-level augmentation strategies, our method is more general and could be easily adapted to any MLE-based training procedure.
arXiv Detail & Related papers (2021-01-02T01:15:57Z)
Unsupervised Paraphrasing with Pretrained Language Models [85.03373221588707]
We propose a training pipeline that enables pre-trained language models to generate high-quality paraphrases in an unsupervised setting. Our recipe consists of task-adaptation, self-supervision, and a novel decoding algorithm named Dynamic Blocking. We show with automatic and human evaluations that our approach achieves state-of-the-art performance on both the Quora Question Pair and the ParaNMT datasets.
arXiv Detail & Related papers (2020-10-24T11:55:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.