Making Large Language Models Better Data Creators
- URL: http://arxiv.org/abs/2310.20111v1
- Date: Tue, 31 Oct 2023 01:08:34 GMT
- Title: Making Large Language Models Better Data Creators
- Authors: Dong-Ho Lee, Jay Pujara, Mohit Sewak, Ryen W. White, Sujay Kumar
Jauhar
- Abstract summary: Large language models (LLMs) have advanced the state-of-the-art in NLP significantly.
deploying them for downstream applications is still challenging due to cost, responsiveness, control, or concerns around privacy and security.
We propose a unified data creation pipeline that requires only a single format example.
- Score: 22.0882632635255
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Although large language models (LLMs) have advanced the state-of-the-art in
NLP significantly, deploying them for downstream applications is still
challenging due to cost, responsiveness, control, or concerns around privacy
and security. As such, trainable models are still the preferred option in some
cases. However, these models still require human-labeled data for optimal
performance, which is expensive and time-consuming to obtain. In order to
address this issue, several techniques to reduce human effort involve labeling
or generating data using LLMs. Although these methods are effective for certain
applications, in practice they encounter difficulties in real-world scenarios.
Labeling data requires careful data selection, while generating data
necessitates task-specific prompt engineering. In this paper, we propose a
unified data creation pipeline that requires only a single formatting example,
and which is applicable to a broad range of tasks, including traditionally
problematic ones with semantically devoid label spaces. In our experiments we
demonstrate that instruction-following LLMs are highly cost-effective data
creators, and that models trained with these data exhibit performance better
than those trained with human-labeled data (by up to 17.5%) on
out-of-distribution evaluation, while maintaining comparable performance on
in-distribution tasks. These results have important implications for the
robustness of NLP systems deployed in the real-world.
Related papers
- SELF-GUIDE: Better Task-Specific Instruction Following via Self-Synthetic Finetuning [70.21358720599821]
Large language models (LLMs) hold the promise of solving diverse tasks when provided with appropriate natural language prompts.
We propose SELF-GUIDE, a multi-stage mechanism in which we synthesize task-specific input-output pairs from the student LLM.
We report an absolute improvement of approximately 15% for classification tasks and 18% for generation tasks in the benchmark's metrics.
arXiv Detail & Related papers (2024-07-16T04:41:58Z) - UniDM: A Unified Framework for Data Manipulation with Large Language Models [66.61466011795798]
Large Language Models (LLMs) resolve multiple data manipulation tasks.
LLMs exhibit bright benefits in terms of performance but still require customized designs to fit each specific task.
We propose UniDM, a unified framework which establishes a new paradigm to process data manipulation tasks.
arXiv Detail & Related papers (2024-05-10T14:44:04Z) - Multi-News+: Cost-efficient Dataset Cleansing via LLM-based Data Annotation [9.497148303350697]
We present a case study that extends the application of large language models (LLMs) for data annotation to enhance the quality of existing datasets.
Specifically, we leverage approaches such as chain-of-thought (CoT) and majority voting to imitate human annotation and classify unrelated documents from the Multi-News dataset.
arXiv Detail & Related papers (2024-04-15T11:36:10Z) - How to Train Data-Efficient LLMs [56.41105687693619]
We study data-efficient approaches for pre-training language models (LLMs)
We find that Ask-LLM and Density sampling are the best methods in their respective categories.
In our comparison of 19 samplers, involving hundreds of evaluation tasks and pre-training runs, we find that Ask-LLM and Density are the best methods in their respective categories.
arXiv Detail & Related papers (2024-02-15T02:27:57Z) - LESS: Selecting Influential Data for Targeted Instruction Tuning [64.78894228923619]
We propose LESS, an efficient algorithm to estimate data influences and perform Low-rank gradiEnt Similarity Search for instruction data selection.
We show that training on a LESS-selected 5% of the data can often outperform training on the full dataset across diverse downstream tasks.
Our method goes beyond surface form cues to identify data that the necessary reasoning skills for the intended downstream application.
arXiv Detail & Related papers (2024-02-06T19:18:04Z) - LLMaAA: Making Large Language Models as Active Annotators [32.57011151031332]
We propose LLMaAA, which takes large language models as annotators and puts them into an active learning loop to determine what to annotate efficiently.
We conduct experiments and analysis on two classic NLP tasks, named entity recognition and relation extraction.
With LLMaAA, task-specific models trained from LLM-generated labels can outperform the teacher within only hundreds of annotated examples.
arXiv Detail & Related papers (2023-10-30T14:54:15Z) - Uncertainty-aware Parameter-Efficient Self-training for Semi-supervised
Language Understanding [38.11411155621616]
We study self-training as one of the predominant semi-supervised learning approaches.
We present UPET, a novel Uncertainty-aware self-Training framework.
We show that UPET achieves a substantial improvement in terms of performance and efficiency.
arXiv Detail & Related papers (2023-10-19T02:18:29Z) - STAR: Boosting Low-Resource Information Extraction by Structure-to-Text
Data Generation with Large Language Models [56.27786433792638]
STAR is a data generation method that leverages Large Language Models (LLMs) to synthesize data instances.
We design fine-grained step-by-step instructions to obtain the initial data instances.
Our experiments show that the data generated by STAR significantly improve the performance of low-resource event extraction and relation extraction tasks.
arXiv Detail & Related papers (2023-05-24T12:15:19Z) - Data Augmentation for Neural NLP [0.0]
Data augmentation is a low-cost approach for tackling data scarcity.
This paper gives an overview of current state-of-the-art data augmentation methods used for natural language processing.
arXiv Detail & Related papers (2023-02-22T14:47:15Z) - Privacy Adhering Machine Un-learning in NLP [66.17039929803933]
In real world industry use Machine Learning to build models on user data.
Such mandates require effort both in terms of data as well as model retraining.
continuous removal of data and model retraining steps do not scale.
We propose textitMachine Unlearning to tackle this challenge.
arXiv Detail & Related papers (2022-12-19T16:06:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.