Dynosaur: A Dynamic Growth Paradigm for Instruction-Tuning Data Curation
- URL: http://arxiv.org/abs/2305.14327v2
- Date: Thu, 26 Oct 2023 05:10:18 GMT
- Title: Dynosaur: A Dynamic Growth Paradigm for Instruction-Tuning Data Curation
- Authors: Da Yin, Xiao Liu, Fan Yin, Ming Zhong, Hritik Bansal, Jiawei Han,
Kai-Wei Chang
- Abstract summary: We propose Dynosaur, a dynamic growth paradigm for the automatic curation of instruction-tuning data.
Based on the metadata of existing datasets, we use LLMs to automatically construct instruction-tuning data by identifying relevant data fields and generating appropriate instructions.
By leveraging the existing annotated datasets, Dynosaur offers several advantages: 1) it reduces the API cost for generating instructions; 2) it provides high-quality data for instruction tuning; and 3) it supports the continuous improvement of models by generating instruction-tuning data when a new annotated dataset becomes available.
- Score: 92.2167864437497
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Instruction tuning has emerged to enhance the capabilities of large language
models (LLMs) to comprehend instructions and generate appropriate responses.
Existing methods either manually annotate or employ LLM (e.g., GPT-series) to
generate data for instruction tuning. However, they often overlook associating
instructions with existing annotated datasets. In this paper, we propose
Dynosaur, a dynamic growth paradigm for the automatic curation of
instruction-tuning data. Based on the metadata of existing datasets, we use
LLMs to automatically construct instruction-tuning data by identifying relevant
data fields and generating appropriate instructions.
By leveraging the existing annotated datasets, Dynosaur offers several
advantages: 1) it reduces the API cost for generating instructions (e.g., it
costs less than $12 USD by calling GPT-3.5-turbo for generating 800K
instruction tuning samples; 2) it provides high-quality data for instruction
tuning (e.g., it performs better than Alpaca and Flan on Super-NI and Longform
with comparable data sizes); and 3) it supports the continuous improvement of
models by generating instruction-tuning data when a new annotated dataset
becomes available. We further investigate a continual learning scheme for
learning with the ever-growing instruction-tuning dataset, and demonstrate that
replaying tasks with diverse instruction embeddings not only helps mitigate
forgetting issues but generalizes to unseen tasks better.
Code and data are available at https://github.com/WadeYin9712/Dynosaur.
Related papers
- Cookbook: A framework for improving LLM generative abilities via programmatic data generating templates [57.29125360837203]
Cookbook is a framework that generates training data consisting of simple patterns over random tokens.
We find that finetuning on Cookbook-generated data is able to improve performance on its corresponding task by up to 52.7 accuracy points.
arXiv Detail & Related papers (2024-10-07T17:29:40Z) - REInstruct: Building Instruction Data from Unlabeled Corpus [49.82314244648043]
We propose REInstruct, a method to automatically build instruction data from an unlabeled corpus.
By training Llama-7b on a combination of 3k seed data and 32k synthetic data from REInstruct, fine-tuned model achieves a 65.41% win rate on AlpacaEval leaderboard against text-davinci-003.
arXiv Detail & Related papers (2024-08-20T09:05:03Z) - MMInstruct: A High-Quality Multi-Modal Instruction Tuning Dataset with Extensive Diversity [80.02202386597138]
We construct a high-quality, diverse visual instruction tuning dataset MMInstruct, which consists of 973K instructions from 24 domains.
Our instruction generation engine enables semi-automatic, low-cost, and multi-domain instruction generation at the cost of manual construction.
arXiv Detail & Related papers (2024-07-22T17:55:22Z) - GenQA: Generating Millions of Instructions from a Handful of Prompts [67.54980063851605]
Most public instruction finetuning datasets are relatively small compared to the closed source datasets used to train industry models.
In this work, we study methods for generating large instruction datasets from a single prompt.
Our dataset meets or exceeds both WizardLM and Ultrachat on both knowledge-intensive leaderboard tasks as well as conversational evaluations.
arXiv Detail & Related papers (2024-06-14T17:44:08Z) - Phased Instruction Fine-Tuning for Large Language Models [12.037895935630882]
Phased Instruction Fine-Tuning (Phased IFT) is proposed, based on the idea that learning to follow instructions is a gradual process.
It assesses instruction difficulty using GPT-4, divides the instruction data into subsets of increasing difficulty, and uptrains the model sequentially on these subsets.
Experiments with Llama-2 7B/13B/70B, Llama3 8/70B and Mistral-7B models using Alpaca data show that Phased IFT significantly outperforms One-off IFT.
arXiv Detail & Related papers (2024-06-01T04:25:26Z) - Learning to Generate Instruction Tuning Datasets for Zero-Shot Task Adaptation [9.574486521686323]
Bonito is a model for conditional task generation that converts unannotated text into task-specific training datasets for instruction tuning.
We show that Bonito significantly improves the average performance of pretrained and instruction tuned models over the de facto self supervised baseline.
arXiv Detail & Related papers (2024-02-28T13:54:57Z) - LongForm: Effective Instruction Tuning with Reverse Instructions [74.14035528786997]
We introduce the LongForm-C dataset, which is created by reverse instructions.
We generate instructions via LLMs for human-written corpus examples using reverse instructions.
Our models outperform 10x larger language models without instruction tuning on tasks such as story/recipe generation and long-form question answering.
arXiv Detail & Related papers (2023-04-17T17:36:35Z) - How Many Data Samples is an Additional Instruction Worth? [20.66688303609522]
Recently introduced instruction-paradigm empowers non-expert users to leverage NLP resources by defining a new task in natural language.
Our results indicate that an additional instruction can be equivalent to 200 data samples on average across tasks.
arXiv Detail & Related papers (2022-03-17T08:30:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.