Data Diversity Matters for Robust Instruction Tuning
- URL: http://arxiv.org/abs/2311.14736v2
- Date: Mon, 5 Feb 2024 16:41:10 GMT
- Title: Data Diversity Matters for Robust Instruction Tuning
- Authors: Alexander Bukharin and Tuo Zhao
- Abstract summary: Recent works have shown that by curating high quality and diverse instruction tuning datasets, we can significantly improve instruction-following capabilities.
We propose a new algorithm, Quality-Diversity Instruction Tuning (QDIT) to control dataset diversity and quality.
We validate the performance of QDIT on several large scale instruction tuning datasets, where we find it can substantially improve worst and average case performance.
- Score: 93.87078483250782
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent works have shown that by curating high quality and diverse instruction
tuning datasets, we can significantly improve instruction-following
capabilities. However, creating such datasets is difficult and most works rely
on manual curation or proprietary language models. Automatic data curation is
difficult as it is still not clear how we can define diversity for instruction
tuning, how diversity and quality depend on one other, and how we can optimize
dataset quality and diversity. To resolve these issue, we propose a new
algorithm, Quality-Diversity Instruction Tuning (QDIT). QDIT provides a simple
method to simultaneously control dataset diversity and quality, allowing us to
conduct an in-depth study on the effect of diversity and quality on instruction
tuning performance. From this study we draw two key insights (1) there is a
natural tradeoff between data diversity and quality and (2) increasing data
diversity significantly improves the worst case instruction following
performance, therefore improving robustness. We validate the performance of
QDIT on several large scale instruction tuning datasets, where we find it can
substantially improve worst and average case performance compared to
quality-driven data selection.
Related papers
- $\textbf{Only-IF}$:Revealing the Decisive Effect of Instruction Diversity on Generalization [1.6958018695660049]
We show that $textbfonly emerges$ when training data is diversified enough across semantic domains.
We extend our analysis to real-world scenarios, including fine-tuning of $textit$textbfspecialist$$ and $textit$textbfgeneralist$$ models.
arXiv Detail & Related papers (2024-10-07T03:15:11Z) - What are the Essential Factors in Crafting Effective Long Context Multi-Hop Instruction Datasets? Insights and Best Practices [91.71951459594074]
Long language models (LLMs) with extended context windows have significantly improved tasks such as information extraction, question answering, and complex planning scenarios.
Existing methods typically utilize the Self-Instruct framework to generate instruction tuning data for better long context capability improvement.
We propose the Multi-agent Interactive Multi-hop Generation framework, incorporating a Quality Verification Agent, a Single-hop Question Generation Agent, a Multiple Question Sampling Strategy, and a Multi-hop Question Merger Agent.
Our findings show that our synthetic high-quality long-context instruction data significantly enhances model performance, even surpassing models trained on larger amounts of human
arXiv Detail & Related papers (2024-09-03T13:30:00Z) - G-DIG: Towards Gradient-based Diverse and High-quality Instruction Data Selection for Machine Translation [21.506844286376275]
We propose a novel gradient-based method to automatically select high-quality and diverse instruction finetuning data for machine translation.
Our key innovation centers around analyzing how individual training examples influence the model during training.
arXiv Detail & Related papers (2024-05-21T16:38:13Z) - Empowering Large Language Models for Textual Data Augmentation [23.483960932358396]
Large language models (LLMs) can potentially act as a powerful tool for textual data augmentation.
This work proposes a new solution, which can automatically generate a large pool of augmentation instructions and select the most suitable task-informed instructions.
Empirically, the proposed approach consistently generates augmented data with better quality compared to non-LLM and LLM-based data augmentation methods.
arXiv Detail & Related papers (2024-04-26T18:04:25Z) - Less is More: High-value Data Selection for Visual Instruction Tuning [127.38740043393527]
We propose a high-value data selection approach TIVE, to eliminate redundancy within the visual instruction data and reduce the training cost.
Our approach using only about 15% data can achieve comparable average performance to the full-data fine-tuned model across eight benchmarks.
arXiv Detail & Related papers (2024-03-14T16:47:25Z) - LESS: Selecting Influential Data for Targeted Instruction Tuning [64.78894228923619]
We propose LESS, an efficient algorithm to estimate data influences and perform Low-rank gradiEnt Similarity Search for instruction data selection.
We show that training on a LESS-selected 5% of the data can often outperform training on the full dataset across diverse downstream tasks.
Our method goes beyond surface form cues to identify data that the necessary reasoning skills for the intended downstream application.
arXiv Detail & Related papers (2024-02-06T19:18:04Z) - One-Shot Learning as Instruction Data Prospector for Large Language Models [108.81681547472138]
textscNuggets uses one-shot learning to select high-quality instruction data from extensive datasets.
We show that instruction tuning with the top 1% of examples curated by textscNuggets substantially outperforms conventional methods employing the entire dataset.
arXiv Detail & Related papers (2023-12-16T03:33:12Z) - Rethinking the Instruction Quality: LIFT is What You Need [20.829372251475476]
Existing quality improvement methods alter instruction data through dataset expansion or curation.
We propose LIFT (LLM Instruction Fusion Transfer), a novel and versatile paradigm designed to elevate the instruction quality to new heights.
Experimental results demonstrate that, even with a limited quantity of high-quality instruction data selected by our paradigm, LLMs consistently uphold robust performance across various tasks.
arXiv Detail & Related papers (2023-12-12T03:30:21Z) - Self-Evolved Diverse Data Sampling for Efficient Instruction Tuning [47.02160072880698]
We introduce a self-evolving mechanism that allows the model itself to actively sample subsets that are equally or even more effective.
The key to our data sampling technique lies in the enhancement of diversity in the chosen subsets.
Extensive experiments across three datasets and benchmarks demonstrate the effectiveness of DiverseEvol.
arXiv Detail & Related papers (2023-11-14T14:10:40Z) - D2 Pruning: Message Passing for Balancing Diversity and Difficulty in
Data Pruning [70.98091101459421]
Coreset selection seeks to select a subset of the training data so as to maximize the performance of models trained on this subset, also referred to as coreset.
We propose a novel pruning algorithm, D2 Pruning, that uses forward and reverse message passing over this dataset graph for coreset selection.
Results show that D2 Pruning improves coreset selection over previous state-of-the-art methods for up to 70% pruning rates.
arXiv Detail & Related papers (2023-10-11T23:01:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.