GenQA: Generating Millions of Instructions from a Handful of Prompts
- URL: http://arxiv.org/abs/2406.10323v1
- Date: Fri, 14 Jun 2024 17:44:08 GMT
- Title: GenQA: Generating Millions of Instructions from a Handful of Prompts
- Authors: Jiuhai Chen, Rifaa Qadri, Yuxin Wen, Neel Jain, John Kirchenbauer, Tianyi Zhou, Tom Goldstein,
- Abstract summary: Most public instruction finetuning datasets are relatively small compared to the closed source datasets used to train industry models.
In this work, we study methods for generating large instruction datasets from a single prompt.
Our dataset meets or exceeds both WizardLM and Ultrachat on both knowledge-intensive leaderboard tasks as well as conversational evaluations.
- Score: 67.54980063851605
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Most public instruction finetuning datasets are relatively small compared to the closed source datasets used to train industry models. To study questions about finetuning at scale, such as curricula and learning rate cooldown schedules, there is a need for industrial-scale datasets. However, this scale necessitates a data generation process that is almost entirely automated. In this work, we study methods for generating large instruction datasets from a single prompt. With little human oversight, we get LLMs to write diverse sets of instruction examples ranging from simple completion tasks to complex multi-turn dialogs across a variety of subject areas. When finetuning a Llama-3 8B base model, our dataset meets or exceeds both WizardLM and Ultrachat on both knowledge-intensive leaderboard tasks as well as conversational evaluations. We release our dataset, the "generator" prompts that created it, and our finetuned model checkpoints.
Related papers
- DataEnvGym: Data Generation Agents in Teacher Environments with Student Feedback [62.235925602004535]
We introduce DataEnvGym, a testbed of teacher environments for data generation agents.
DataEnvGym frames data generation as a sequential decision-making task.
Agent's goal is to improve student performance.
We support 3 diverse tasks (math, code, and VQA) and test multiple students and teachers.
arXiv Detail & Related papers (2024-10-08T17:20:37Z) - CoDi: Conversational Distillation for Grounded Question Answering [10.265241619616676]
We introduce a novel data distillation framework named CoDi.
CoDi allows us to synthesize large-scale, assistant-style datasets in a steerable and diverse manner.
We show that SLMs trained with CoDi-synthesized data achieve performance comparable to models trained on human-annotated data in standard metrics.
arXiv Detail & Related papers (2024-08-20T22:35:47Z) - An Automatic Prompt Generation System for Tabular Data Tasks [3.117741687220381]
Large language models (LLMs) have demonstrated their ability on several tasks through carefully crafted prompts.
This paper presents an innovative auto-prompt generation system suitable for multiple LLMs, with minimal training.
arXiv Detail & Related papers (2024-05-09T08:32:55Z) - Prompt2Model: Generating Deployable Models from Natural Language
Instructions [74.19816829003729]
Large language models (LLMs) enable system builders to create competent NLP systems through prompting.
In other ways, LLMs are a step backward from traditional special-purpose NLP models.
We propose Prompt2Model, a general-purpose method that takes a natural language task description like the prompts provided to LLMs.
arXiv Detail & Related papers (2023-08-23T17:28:21Z) - Diffusion Model is an Effective Planner and Data Synthesizer for
Multi-Task Reinforcement Learning [101.66860222415512]
Multi-Task Diffusion Model (textscMTDiff) is a diffusion-based method that incorporates Transformer backbones and prompt learning for generative planning and data synthesis.
For generative planning, we find textscMTDiff outperforms state-of-the-art algorithms across 50 tasks on Meta-World and 8 maps on Maze2D.
arXiv Detail & Related papers (2023-05-29T05:20:38Z) - Dynosaur: A Dynamic Growth Paradigm for Instruction-Tuning Data Curation [92.2167864437497]
We propose Dynosaur, a dynamic growth paradigm for the automatic curation of instruction-tuning data.
Based on the metadata of existing datasets, we use LLMs to automatically construct instruction-tuning data by identifying relevant data fields and generating appropriate instructions.
By leveraging the existing annotated datasets, Dynosaur offers several advantages: 1) it reduces the API cost for generating instructions; 2) it provides high-quality data for instruction tuning; and 3) it supports the continuous improvement of models by generating instruction-tuning data when a new annotated dataset becomes available.
arXiv Detail & Related papers (2023-05-23T17:56:26Z) - AnnoLLM: Making Large Language Models to Be Better Crowdsourced Annotators [98.11286353828525]
GPT-3.5 series models have demonstrated remarkable few-shot and zero-shot ability across various NLP tasks.
We propose AnnoLLM, which adopts a two-step approach, explain-then-annotate.
We build the first conversation-based information retrieval dataset employing AnnoLLM.
arXiv Detail & Related papers (2023-03-29T17:03:21Z) - Intermediate Training on Question Answering Datasets Improves Generative
Data Augmentation [32.83012699501051]
We improve generative data augmentation by formulating the data generation as context generation task.
We cast downstream tasks into question answering format and adapt the fine-tuned context generators to the target task domain.
We demonstrate substantial improvements in performance in few-shot, zero-shot settings.
arXiv Detail & Related papers (2022-05-25T09:28:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.