The Best Instruction-Tuning Data are Those That Fit
- URL: http://arxiv.org/abs/2502.04194v2
- Date: Fri, 07 Feb 2025 02:20:28 GMT
- Title: The Best Instruction-Tuning Data are Those That Fit
- Authors: Dylan Zhang, Qirun Dai, Hao Peng,
- Abstract summary: Supervised fine-tuning (SFT) data are crucial for eliciting strong capabilities from pretrained large language models (LLMs)
We propose **GRAPE**, a novel SFT framework that accounts for the unique characteristics of the target model.
For each instruction, it gathers responses from various LLMs and selects the one with the highest probability measured by the target model.
- Score: 17.401088816596054
- License:
- Abstract: High-quality supervised fine-tuning (SFT) data are crucial for eliciting strong capabilities from pretrained large language models (LLMs). Typically, instructions are paired with multiple responses sampled from other LLMs, which are often out of the distribution of the target model to be fine-tuned. This, at scale, can lead to diminishing returns and even hurt the models' performance and robustness. We propose **GRAPE**, a novel SFT framework that accounts for the unique characteristics of the target model. For each instruction, it gathers responses from various LLMs and selects the one with the highest probability measured by the target model, indicating that it aligns most closely with the target model's pretrained distribution; it then proceeds with standard SFT training. We first evaluate GRAPE with a controlled experiment, where we sample various solutions for each question in UltraInteract from multiple models and fine-tune commonly used LMs like LLaMA3.1-8B, Mistral-7B, and Qwen2.5-7B on GRAPE-selected data. GRAPE significantly outperforms strong baselines, including distilling from the strongest model with an absolute gain of up to 13.8%, averaged across benchmarks, and training on 3x more data with a maximum performance improvement of 17.3%. GRAPE's strong performance generalizes to realistic settings. We experiment with the post-training data used for Tulu3 and Olmo-2. GRAPE outperforms strong baselines trained on 4.5 times more data by 6.1% and a state-of-the-art data selection approach by 3% on average performance. Remarkably, using 1/3 of the data and half the number of epochs, GRAPE enables LLaMA3.1-8B to surpass the performance of Tulu3-SFT by 3.5%.
Related papers
- Selective Self-to-Supervised Fine-Tuning for Generalization in Large Language Models [24.659722730219134]
This paper introduces Selective Self-to-Supervised Fine-Tuning (S3FT)
S3FT achieves better performance than the standard supervised fine-tuning (SFT) while improving generalization.
The effectiveness of S3FT is demonstrated through experiments on mathematical reasoning, Python programming and reading comprehension tasks.
arXiv Detail & Related papers (2025-02-12T05:24:21Z) - Preference Curriculum: LLMs Should Always Be Pretrained on Their Preferred Data [19.221998577357713]
Large language models (LLMs) generally utilize a consistent data distribution throughout the pretraining process.
As the model's capability improves, it is intuitive that its data preferences dynamically change, indicating the need for pretraining with different data at various training stages.
We propose the Perplexity Difference (PD) based Preference Curriculum learning framework, which always perceives and uses the data preferred by LLMs to train and boost them.
arXiv Detail & Related papers (2025-01-21T13:12:13Z) - Star-Agents: Automatic Data Optimization with LLM Agents for Instruction Tuning [71.2981957820888]
We propose a novel Star-Agents framework, which automates the enhancement of data quality across datasets.
The framework initially generates diverse instruction data with multiple LLM agents through a bespoke sampling method.
The generated data undergo a rigorous evaluation using a dual-model method that assesses both difficulty and quality.
arXiv Detail & Related papers (2024-11-21T02:30:53Z) - Training on the Benchmark Is Not All You Need [52.01920740114261]
We propose a simple and effective data leakage detection method based on the contents of multiple-choice options.
Our method is able to work under black-box conditions without access to model training data or weights.
We evaluate the degree of data leakage of 31 mainstream open-source LLMs on four benchmark datasets.
arXiv Detail & Related papers (2024-09-03T11:09:44Z) - Evaluating the Performance of Large Language Models for SDG Mapping (Technical Report) [6.789534723913505]
Large language models (LLMs) enable users to protect data privacy by eliminating the need to provide data to third parties.
We compare the performance of various language models on the Sustainable Development Goal mapping task.
According to the results of this study, LLaMA 2 and Gemma still have significant room for improvement.
arXiv Detail & Related papers (2024-08-05T03:05:02Z) - DEM: Distribution Edited Model for Training with Mixed Data Distributions [15.064693005258324]
We propose a simple and efficient alternative for better optimization of the data sources by combining models individually trained on each data source with the base model using basic element-wise vector operations.
The resulting model, namely Distribution Edited Model (DEM), is 11x cheaper than standard data mixing and outperforms strong baselines on a variety of benchmarks.
DEM does not require full re-training when modifying a single data-source, thus making it very flexible and scalable for training with diverse data sources.
arXiv Detail & Related papers (2024-06-21T18:07:46Z) - How to Train Data-Efficient LLMs [56.41105687693619]
We study data-efficient approaches for pre-training language models (LLMs)
We find that Ask-LLM and Density sampling are the best methods in their respective categories.
In our comparison of 19 samplers, involving hundreds of evaluation tasks and pre-training runs, we find that Ask-LLM and Density are the best methods in their respective categories.
arXiv Detail & Related papers (2024-02-15T02:27:57Z) - Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models [52.98743860365194]
We propose a new fine-tuning method called Self-Play fIne-tuNing (SPIN)
At the heart of SPIN lies a self-play mechanism, where the LLM refines its capability by playing against instances of itself.
This sheds light on the promise of self-play, enabling the achievement of human-level performance in LLMs without the need for expert opponents.
arXiv Detail & Related papers (2024-01-02T18:53:13Z) - Scaling Relationship on Learning Mathematical Reasoning with Large
Language Models [75.29595679428105]
We investigate how the pre-training loss, supervised data amount, and augmented data amount influence the reasoning performances of a supervised LLM.
We find that rejection samples from multiple models push LLaMA-7B to an accuracy of 49.3% on GSM8K which outperforms the supervised fine-tuning (SFT) accuracy of 35.9% significantly.
arXiv Detail & Related papers (2023-08-03T15:34:01Z) - Distilling Step-by-Step! Outperforming Larger Language Models with Less
Training Data and Smaller Model Sizes [91.58845026796149]
We introduce Distilling step-by-step, a new mechanism that trains small models that outperform large language models.
We present three findings across 4 NLP benchmarks.
arXiv Detail & Related papers (2023-05-03T17:50:56Z) - Complementary Ensemble Learning [1.90365714903665]
We derive a technique to improve performance of state-of-the-art deep learning models.
Specifically, we train auxiliary models which are able to complement state-of-the-art model uncertainty.
arXiv Detail & Related papers (2021-11-09T03:23:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.