Dynamics of Instruction Tuning: Each Ability of Large Language Models
Has Its Own Growth Pace
- URL: http://arxiv.org/abs/2310.19651v2
- Date: Thu, 22 Feb 2024 13:21:27 GMT
- Title: Dynamics of Instruction Tuning: Each Ability of Large Language Models
Has Its Own Growth Pace
- Authors: Chiyu Song, Zhanchao Zhou, Jianhao Yan, Yuejiao Fei, Zhenzhong Lan,
Yue Zhang
- Abstract summary: We present a dataset with over 40k instances across ten abilities and examine instruction-tuned models with 7b to 33b parameters.
Our study reveals three primary findings: (i) Despite the models' overall performance being tied to data and parameter scale, individual abilities have different sensitivities to these factors.
Human-curated data strongly outperforms synthetic data from GPT-4 in efficiency and can constantly enhance model performance with volume increases.
- Score: 21.015261553612643
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Instruction tuning is a burgeoning method to elicit the general intelligence
of Large Language Models (LLMs). However, the creation of instruction data is
still largely heuristic, leading to significant variation in quantity and
quality across existing datasets. While some research advocates for expanding
the number of instructions, others suggest that a small set of well-chosen
examples is adequate. To better understand data construction guidelines, our
research provides a granular analysis of how data volume, parameter size, and
data construction methods influence the development of each underlying ability
of LLM, such as creative writing, code generation, and logical reasoning. We
present a meticulously curated dataset with over 40k instances across ten
abilities and examine instruction-tuned models with 7b to 33b parameters. Our
study reveals three primary findings: (i) Despite the models' overall
performance being tied to data and parameter scale, individual abilities have
different sensitivities to these factors. (ii) Human-curated data strongly
outperforms synthetic data from GPT-4 in efficiency and can constantly enhance
model performance with volume increases, but is unachievable with synthetic
data. (iii) Instruction data brings powerful cross-ability generalization, as
evidenced by out-of-domain evaluations. Furthermore, we demonstrate how these
findings can guide more efficient data constructions, leading to practical
performance improvements on two public benchmarks.
Related papers
- AgentInstruct: Toward Generative Teaching with Agentic Flows [12.192372792525726]
We focus on using synthetic data for post-training, specifically creating data by powerful models to teach a new skill or behavior to another model.
We introduce AgentInstruct, an agentic framework for automatically creating large amounts of diverse and high-quality synthetic data.
We demonstrate the utility of AgentInstruct by creating a post training dataset of 25M pairs to teach language models different skills, such as text editing, creative writing, tool usage, coding, reading comprehension, etc.
arXiv Detail & Related papers (2024-07-03T21:01:12Z) - Unveiling the Impact of Coding Data Instruction Fine-Tuning on Large Language Models Reasoning [64.5243480989869]
Instruction Fine-Tuning (IFT) significantly enhances the zero-shot capabilities of pretrained Large Language Models (LLMs)
This paper investigates how coding data impact LLMs' reasoning capacities during the IFT stage.
arXiv Detail & Related papers (2024-05-30T23:20:25Z) - A Comprehensive Survey on Data Augmentation [55.355273602421384]
Data augmentation is a technique that generates high-quality artificial data by manipulating existing data samples.
Existing literature surveys only focus on a certain type of specific modality data.
We propose a more enlightening taxonomy that encompasses data augmentation techniques for different common data modalities.
arXiv Detail & Related papers (2024-05-15T11:58:08Z) - LESS: Selecting Influential Data for Targeted Instruction Tuning [64.78894228923619]
We propose LESS, an efficient algorithm to estimate data influences and perform Low-rank gradiEnt Similarity Search for instruction data selection.
We show that training on a LESS-selected 5% of the data can often outperform training on the full dataset across diverse downstream tasks.
Our method goes beyond surface form cues to identify data that the necessary reasoning skills for the intended downstream application.
arXiv Detail & Related papers (2024-02-06T19:18:04Z) - Self-Evolved Diverse Data Sampling for Efficient Instruction Tuning [47.02160072880698]
We introduce a self-evolving mechanism that allows the model itself to actively sample subsets that are equally or even more effective.
The key to our data sampling technique lies in the enhancement of diversity in the chosen subsets.
Extensive experiments across three datasets and benchmarks demonstrate the effectiveness of DiverseEvol.
arXiv Detail & Related papers (2023-11-14T14:10:40Z) - StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized
Image-Dialogue Data [129.92449761766025]
We propose a novel data collection methodology that synchronously synthesizes images and dialogues for visual instruction tuning.
This approach harnesses the power of generative models, marrying the abilities of ChatGPT and text-to-image generative models.
Our research includes comprehensive experiments conducted on various datasets.
arXiv Detail & Related papers (2023-08-20T12:43:52Z) - Exploring the Impact of Instruction Data Scaling on Large Language
Models: An Empirical Study on Real-World Use Cases [17.431381376675432]
In this paper we explore the performance of large language models based on instruction tuning across different scales of instruction data.
With Bloomz-7B1-mt as the base model, the results show that merely increasing the amount of instruction data leads to continuous improvement in tasks such as open-ended generation.
We propose potential future research directions such as effectively selecting high-quality training data, scaling base models and training methods specialized for hard tasks.
arXiv Detail & Related papers (2023-03-26T14:49:37Z) - An Empirical Investigation of Commonsense Self-Supervision with
Knowledge Graphs [67.23285413610243]
Self-supervision based on the information extracted from large knowledge graphs has been shown to improve the generalization of language models.
We study the effect of knowledge sampling strategies and sizes that can be used to generate synthetic data for adapting language models.
arXiv Detail & Related papers (2022-05-21T19:49:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.