Dynosaur: A Dynamic Growth Paradigm for Instruction-Tuning Data Curation
- URL: http://arxiv.org/abs/2305.14327v2
- Date: Thu, 26 Oct 2023 05:10:18 GMT
- Title: Dynosaur: A Dynamic Growth Paradigm for Instruction-Tuning Data Curation
- Authors: Da Yin, Xiao Liu, Fan Yin, Ming Zhong, Hritik Bansal, Jiawei Han,
Kai-Wei Chang
- Abstract summary: We propose Dynosaur, a dynamic growth paradigm for the automatic curation of instruction-tuning data.
Based on the metadata of existing datasets, we use LLMs to automatically construct instruction-tuning data by identifying relevant data fields and generating appropriate instructions.
By leveraging the existing annotated datasets, Dynosaur offers several advantages: 1) it reduces the API cost for generating instructions; 2) it provides high-quality data for instruction tuning; and 3) it supports the continuous improvement of models by generating instruction-tuning data when a new annotated dataset becomes available.
- Score: 92.2167864437497
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Instruction tuning has emerged to enhance the capabilities of large language
models (LLMs) to comprehend instructions and generate appropriate responses.
Existing methods either manually annotate or employ LLM (e.g., GPT-series) to
generate data for instruction tuning. However, they often overlook associating
instructions with existing annotated datasets. In this paper, we propose
Dynosaur, a dynamic growth paradigm for the automatic curation of
instruction-tuning data. Based on the metadata of existing datasets, we use
LLMs to automatically construct instruction-tuning data by identifying relevant
data fields and generating appropriate instructions.
By leveraging the existing annotated datasets, Dynosaur offers several
advantages: 1) it reduces the API cost for generating instructions (e.g., it
costs less than $12 USD by calling GPT-3.5-turbo for generating 800K
instruction tuning samples; 2) it provides high-quality data for instruction
tuning (e.g., it performs better than Alpaca and Flan on Super-NI and Longform
with comparable data sizes); and 3) it supports the continuous improvement of
models by generating instruction-tuning data when a new annotated dataset
becomes available. We further investigate a continual learning scheme for
learning with the ever-growing instruction-tuning dataset, and demonstrate that
replaying tasks with diverse instruction embeddings not only helps mitigate
forgetting issues but generalizes to unseen tasks better.
Code and data are available at https://github.com/WadeYin9712/Dynosaur.
Related papers
- MMInstruct: A High-Quality Multi-Modal Instruction Tuning Dataset with Extensive Diversity [80.02202386597138]
We construct a high-quality, diverse visual instruction tuning dataset MMInstruct, which consists of 973K instructions from 24 domains.
Our instruction generation engine enables semi-automatic, low-cost, and multi-domain instruction generation at the cost of manual construction.
arXiv Detail & Related papers (2024-07-22T17:55:22Z) - GenQA: Generating Millions of Instructions from a Handful of Prompts [67.54980063851605]
Most public instruction finetuning datasets are relatively small compared to the closed source datasets used to train industry models.
In this work, we study methods for generating large instruction datasets from a single prompt.
Our dataset meets or exceeds both WizardLM and Ultrachat on both knowledge-intensive leaderboard tasks as well as conversational evaluations.
arXiv Detail & Related papers (2024-06-14T17:44:08Z) - Phased Instruction Fine-Tuning for Large Language Models [12.037895935630882]
Phased Instruction Fine-Tuning (Phased IFT) is proposed, based on the idea that learning to follow instructions is a gradual process.
It assesses instruction difficulty using GPT-4, divides the instruction data into subsets of increasing difficulty, and uptrains the model sequentially on these subsets.
Experiments with Llama-2 7B/13B/70B, Llama3 8/70B and Mistral-7B models using Alpaca data show that Phased IFT significantly outperforms One-off IFT.
arXiv Detail & Related papers (2024-06-01T04:25:26Z) - CodecLM: Aligning Language Models with Tailored Synthetic Data [51.59223474427153]
We introduce CodecLM, a framework for adaptively generating high-quality synthetic data for instruction-following abilities.
We first encode seed instructions into metadata, which are concise keywords generated on-the-fly to capture the target instruction distribution.
We also introduce Self-Rubrics and Contrastive Filtering during decoding to tailor data-efficient samples.
arXiv Detail & Related papers (2024-04-08T21:15:36Z) - Learning to Generate Instruction Tuning Datasets for Zero-Shot Task Adaptation [9.574486521686323]
Bonito is a model for conditional task generation that converts unannotated text into task-specific training datasets for instruction tuning.
We show that Bonito significantly improves the average performance of pretrained and instruction tuned models over the de facto self supervised baseline.
arXiv Detail & Related papers (2024-02-28T13:54:57Z) - Exploring Format Consistency for Instruction Tuning [79.0698403613366]
In this work, we propose a framework named Unified Instruction Tuning (UIT)
UIT calls OpenAI APIs for automatic format transfer among different instruction tuning datasets such as PromptSource, FLAN and CrossFit.
With the framework, we demonstrate the necessity of maintaining format consistency in instruction tuning; (2) improve the generalization performance on unseen instructions on T5-LM-xl; and (3) provide a novel perplexity-based denoising method to reduce the noise of automatic format transfer.
arXiv Detail & Related papers (2023-07-28T12:00:13Z) - Self-Instruct: Aligning Language Models with Self-Generated Instructions [76.42871502364697]
Self-Instruct is a framework for improving the instruction-following capabilities of pretrained language models.
Our pipeline generates instructions, input, and output samples from a language model, then filters invalid or similar ones before using them to finetune the original model.
For further evaluation, we curate a set of expert-written instructions for novel tasks, and show through human evaluation that tuning GPT3 with Self-Instruct outperforms using existing public instruction datasets by a large margin.
arXiv Detail & Related papers (2022-12-20T18:59:19Z) - How Many Data Samples is an Additional Instruction Worth? [20.66688303609522]
Recently introduced instruction-paradigm empowers non-expert users to leverage NLP resources by defining a new task in natural language.
Our results indicate that an additional instruction can be equivalent to 200 data samples on average across tasks.
arXiv Detail & Related papers (2022-03-17T08:30:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.