SmallToLarge (S2L): Scalable Data Selection for Fine-tuning Large
Language Models by Summarizing Training Trajectories of Small Models
- URL: http://arxiv.org/abs/2403.07384v1
- Date: Tue, 12 Mar 2024 07:45:33 GMT
- Title: SmallToLarge (S2L): Scalable Data Selection for Fine-tuning Large
Language Models by Summarizing Training Trajectories of Small Models
- Authors: Yu Yang, Siddhartha Mishra, Jeffrey N Chiang, Baharan Mirzasoleiman
- Abstract summary: We introduce an effective and scalable data selection method for supervised fine-tuning.
We show that S2L significantly improves data efficiency in SFT for mathematical problem-solving.
We also show that S2L can perform data selection using a reference model 40x smaller than the target model.
- Score: 25.354520724493845
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite the effectiveness of data selection for large language models (LLMs)
during pretraining and instruction fine-tuning phases, improving data
efficiency in supervised fine-tuning (SFT) for specialized domains poses
significant challenges due to the complexity of fine-tuning data. To bridge
this gap, we introduce an effective and scalable data selection method for SFT,
SmallToLarge (S2L), which leverages training trajectories from small models to
guide the data selection for larger models. We demonstrate through extensive
experiments that S2L significantly improves data efficiency in SFT for
mathematical problem-solving, reducing the training data to just 11% of the
original MathInstruct dataset (Yue et al., 2023) to match full dataset
performance while outperforming state-of-the-art data selection algorithms by
an average of 4.7% across 6 in- and out-domain evaluation datasets. Remarkably,
selecting only 50K data for SFT, S2L achieves a 32.7% accuracy on the most
challenging MATH (Hendrycks et al., 2021) benchmark, improving Phi-2 (Li et
al., 2023b) by 16.6%. In clinical text summarization on the MIMIC-III dataset
(Johnson et al., 2016), S2L again outperforms training on the full dataset
using only 50% of the data. Notably, S2L can perform data selection using a
reference model 40x smaller than the target model, proportionally reducing the
cost of data selection.
Related papers
- 3DS: Decomposed Difficulty Data Selection's Case Study on LLM Medical Domain Adaptation [13.058299222554295]
Large Language Models excel in general tasks but struggle in specialized domains like healthcare.
We propose a two-stage model-centric data selection framework, De Difficulty Data Selection (3DS)
Our experiments on real-world healthcare datasets demonstrate the superiority of 3DS over exisiting methods in accuracy by over 5.29%.
arXiv Detail & Related papers (2024-10-13T02:29:00Z) - Rethinking Data Selection at Scale: Random Selection is Almost All You Need [39.14807071480125]
Supervised fine-tuning is crucial for aligning Large Language Models with human instructions.
Most existing data selection techniques are designed for small-scale data pools.
arXiv Detail & Related papers (2024-10-12T02:48:34Z) - LLM2LLM: Boosting LLMs with Novel Iterative Data Enhancement [79.31084387589968]
Pretrained large language models (LLMs) are currently state-of-the-art for solving the vast majority of natural language processing tasks.
We propose LLM2LLM, a data augmentation strategy that uses a teacher LLM to enhance a small seed dataset.
We achieve improvements up to 24.2% on the GSM8K dataset, 32.6% on CaseHOLD, 32.0% on SNIPS, 52.6% on TREC and 39.8% on SST-2 over regular fine-tuning in the low-data regime.
arXiv Detail & Related papers (2024-03-22T08:57:07Z) - How to Train Data-Efficient LLMs [56.41105687693619]
We study data-efficient approaches for pre-training language models (LLMs)
We find that Ask-LLM and Density sampling are the best methods in their respective categories.
In our comparison of 19 samplers, involving hundreds of evaluation tasks and pre-training runs, we find that Ask-LLM and Density are the best methods in their respective categories.
arXiv Detail & Related papers (2024-02-15T02:27:57Z) - LESS: Selecting Influential Data for Targeted Instruction Tuning [64.78894228923619]
We propose LESS, an efficient algorithm to estimate data influences and perform Low-rank gradiEnt Similarity Search for instruction data selection.
We show that training on a LESS-selected 5% of the data can often outperform training on the full dataset across diverse downstream tasks.
Our method goes beyond surface form cues to identify data that the necessary reasoning skills for the intended downstream application.
arXiv Detail & Related papers (2024-02-06T19:18:04Z) - Efficient Online Data Mixing For Language Model Pre-Training [101.45242332613944]
Existing data selection methods suffer from slow and computationally expensive processes.
Data mixing, on the other hand, reduces the complexity of data selection by grouping data points together.
We develop an efficient algorithm for Online Data Mixing (ODM) that combines elements from both data selection and data mixing.
arXiv Detail & Related papers (2023-12-05T00:42:35Z) - Efficient Grammatical Error Correction Via Multi-Task Training and
Optimized Training Schedule [55.08778142798106]
We propose auxiliary tasks that exploit the alignment between the original and corrected sentences.
We formulate each task as a sequence-to-sequence problem and perform multi-task training.
We find that the order of datasets used for training and even individual instances within a dataset may have important effects on the final performance.
arXiv Detail & Related papers (2023-11-20T14:50:12Z) - LoBaSS: Gauging Learnability in Supervised Fine-tuning Data [64.27898739929734]
Supervised Fine-Tuning (SFT) serves as a crucial phase in aligning Large Language Models (LLMs) to specific task prerequisites.
We introduce a new dimension in SFT data selection: learnability.
We present the Loss Based SFT Data Selection (LoBaSS) method, utilizing data learnability as the principal criterion for the selection SFT data.
arXiv Detail & Related papers (2023-10-16T07:26:24Z) - D4: Improving LLM Pretraining via Document De-Duplication and
Diversification [38.84592304799403]
We show that careful data selection via pre-trained model embeddings can speed up training.
We also show that repeating data intelligently consistently outperforms baseline training.
arXiv Detail & Related papers (2023-08-23T17:58:14Z) - An Empirical Study of Large-Scale Data-Driven Full Waveform Inversion [33.19446101601603]
This paper investigates the impact of big data on deep learning models to help solve the full waveform inversion (FWI) problem.
We train and evaluate the FWI models on a combination of 10 2D subsets in OpenFWI that contain 470K pairs of seismic data and velocity maps in total.
Our experiments demonstrate that training on the combined dataset yields an average improvement of 13.03% in MAE, 7.19% in MSE and 1.87% in SSIM.
arXiv Detail & Related papers (2023-07-28T08:32:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.