Related papers: DELIFT: Data Efficient Language model Instruction Fine Tuning

DELIFT: Data Efficient Language model Instruction Fine Tuning

URL: http://arxiv.org/abs/2411.04425v2
Date: Sun, 10 Nov 2024 05:24:33 GMT
Title: DELIFT: Data Efficient Language model Instruction Fine Tuning
Authors: Ishika Agarwal, Krishnateja Killamsetty, Lucian Popa, Marina Danilevksy,
Abstract summary: We introduce DELIFT, a novel algorithm that systematically optimize data selection across the three key stages of fine-tuning. Experiments across various tasks and model scales demonstrate that DELIFT can reduce the fine-tuning data size by up to 70% without compromising performance.
Score: 13.538140114667772
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Fine-tuning large language models (LLMs) is essential for enhancing their performance on specific tasks but is often resource-intensive due to redundant or uninformative data. To address this inefficiency, we introduce DELIFT (Data Efficient Language model Instruction Fine-Tuning), a novel algorithm that systematically optimizes data selection across the three key stages of fine-tuning: (1) instruction tuning, (2) task-specific fine-tuning (e.g., reasoning, question-answering), and (3) continual fine-tuning (e.g., incorporating new data versions). Unlike existing methods that focus on single-stage optimization or rely on computationally intensive gradient calculations, DELIFT operates efficiently across all stages. Central to our approach is a pairwise utility metric that quantifies how beneficial a data sample is for improving the model's responses to other samples, effectively measuring the informational value relative to the model's current capabilities. By leveraging different submodular functions applied to this metric, DELIFT selects diverse and optimal subsets that are useful across all stages of fine-tuning. Experiments across various tasks and model scales demonstrate that DELIFT can reduce the fine-tuning data size by up to 70% without compromising performance, offering significant computational savings and outperforming existing methods in both efficiency and efficacy.

Related papers

Improving Task Diversity in Label Efficient Supervised Finetuning of LLMs [14.531280062127442]
Large Language Models (LLMs) have demonstrated remarkable capabilities across diverse domains, but developing high-performing models for specialized applications often requires substantial human annotation.<n>We address the label-efficient learning problem for supervised finetuning (SFT) by leveraging task-diversity as a fundamental principle for effective data selection.<n>Our approach is based on two key observations: 1) task labels for different prompts are often readily available; 2) pre-trained models have significantly varying levels of confidence across tasks.
arXiv Detail & Related papers (2025-07-29T03:51:00Z)
Data Efficacy for Language Model Training [29.901090317084005]
Data is fundamental to the training of language models (LM)<n>Recent research has been dedicated to data efficiency, which aims to maximize performance by selecting a minimal or optimal subset of training data.<n>This work introduces a general paradigm, DELT, for considering data efficacy in LM training.
arXiv Detail & Related papers (2025-06-26T17:59:07Z)
Data Whisperer: Efficient Data Selection for Task-Specific LLM Fine-Tuning via Few-Shot In-Context Learning [40.19639581728674]
Fine-tuning large language models (LLMs) on task-specific data is essential for their effective deployment.<n>We propose Data Whisperer, an efficient, training-free, attention-based method that leverages few-shot in-context learning with the model to be fine-tuned.<n>Data Whisperer achieves superior performance compared to the full GSM8K dataset on the Llama-3-8B-Instruct model, using just 10% of the data, and outperforms existing methods with a 3.1-point improvement and a 7.4$times$ speedup.
arXiv Detail & Related papers (2025-05-18T03:10:00Z)
DONOD: Robust and Generalizable Instruction Fine-Tuning for LLMs via Model-Intrinsic Dataset Pruning [22.704995231753397]
Ad-hoc instruction fine-tuning of large language models (LLMs) is widely adopted for domain-specific adaptation. We propose DONOD, a lightweight model-intrinsic data pruning method. By filtering out 70% of the full dataset, we improve target-domain accuracy by 14.90% and cross-domain accuracy by 5.67%.
arXiv Detail & Related papers (2025-04-21T02:25:03Z)
PRISM: Self-Pruning Intrinsic Selection Method for Training-Free Multimodal Data Selection [28.442470930703337]
PRISM is a training-free approach for efficient multimodal data selection. It uses Pearson correlation analysis to quantify the intrinsic visual encoding properties of MLLMs. It reduces the overall time required for visual instruction tuning and data selection to just 30% of conventional methods.
arXiv Detail & Related papers (2025-02-17T18:43:41Z)
REP: Resource-Efficient Prompting for On-device Continual Learning [23.92661395403251]
On-device continual learning (CL) requires the co-optimization of model accuracy and resource efficiency to be practical. It is commonly believed that CNN-based CL excels in resource efficiency, whereas ViT-based CL is superior in model performance. We introduce REP, which improves resource efficiency specifically targeting prompt-based rehearsal-free methods.
arXiv Detail & Related papers (2024-06-07T09:17:33Z)
Rethinking Overlooked Aspects in Vision-Language Models [32.525916879333145]
Recent advancements in vision-language models (LVLMs) have been substantial. Recent works mainly focus on introducing more pre-training and instruction tuning data to improve model's performance. This paper delves into the often-neglected aspects of data efficiency during pre-training and the selection process for instruction tuning datasets.
arXiv Detail & Related papers (2024-05-20T07:53:41Z)
LESS: Selecting Influential Data for Targeted Instruction Tuning [64.78894228923619]
We propose LESS, an efficient algorithm to estimate data influences and perform Low-rank gradiEnt Similarity Search for instruction data selection. We show that training on a LESS-selected 5% of the data can often outperform training on the full dataset across diverse downstream tasks. Our method goes beyond surface form cues to identify data that the necessary reasoning skills for the intended downstream application.
arXiv Detail & Related papers (2024-02-06T19:18:04Z)
When Parameter-efficient Tuning Meets General-purpose Vision-language Models [65.19127815275307]
PETAL revolutionizes the training process by requiring only 0.5% of the total parameters, achieved through a unique mode approximation technique. Our experiments reveal that PETAL not only outperforms current state-of-the-art methods in most scenarios but also surpasses full fine-tuning models in effectiveness.
arXiv Detail & Related papers (2023-12-16T17:13:08Z)
Self-Evolved Diverse Data Sampling for Efficient Instruction Tuning [47.02160072880698]
We introduce a self-evolving mechanism that allows the model itself to actively sample subsets that are equally or even more effective. The key to our data sampling technique lies in the enhancement of diversity in the chosen subsets. Extensive experiments across three datasets and benchmarks demonstrate the effectiveness of DiverseEvol.
arXiv Detail & Related papers (2023-11-14T14:10:40Z)
Dynamics of Instruction Tuning: Each Ability of Large Language Models Has Its Own Growth Pace [21.015261553612643]
We present a dataset with over 40k instances across ten abilities and examine instruction-tuned models with 7b to 33b parameters. Our study reveals three primary findings: (i) Despite the models' overall performance being tied to data and parameter scale, individual abilities have different sensitivities to these factors. Human-curated data strongly outperforms synthetic data from GPT-4 in efficiency and can constantly enhance model performance with volume increases.
arXiv Detail & Related papers (2023-10-30T15:37:10Z)
Federated Learning of Large Language Models with Parameter-Efficient Prompt Tuning and Adaptive Optimization [71.87335804334616]
Federated learning (FL) is a promising paradigm to enable collaborative model training with decentralized data. The training process of Large Language Models (LLMs) generally incurs the update of significant parameters. This paper proposes an efficient partial prompt tuning approach to improve performance and efficiency simultaneously.
arXiv Detail & Related papers (2023-10-23T16:37:59Z)
E^2VPT: An Effective and Efficient Approach for Visual Prompt Tuning [55.50908600818483]
Fine-tuning large-scale pretrained vision models for new tasks has become increasingly parameter-intensive. We propose an Effective and Efficient Visual Prompt Tuning (E2VPT) approach for large-scale transformer-based model adaptation. Our approach outperforms several state-of-the-art baselines on two benchmarks.
arXiv Detail & Related papers (2023-07-25T19:03:21Z)
Improved Distribution Matching for Dataset Condensation [91.55972945798531]
We propose a novel dataset condensation method based on distribution matching. Our simple yet effective method outperforms most previous optimization-oriented methods with much fewer computational resources.
arXiv Detail & Related papers (2023-07-19T04:07:33Z)
Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learning [81.3514358542452]
Few-shot in-context learning (ICL) incurs substantial computational, memory, and storage costs because it involves processing all of the training examples every time a prediction is made. parameter-efficient fine-tuning offers an alternative paradigm where a small set of parameters are trained to enable a model to perform the new task. In this paper, we rigorously compare few-shot ICL and parameter-efficient fine-tuning and demonstrate that the latter offers better accuracy as well as dramatically lower computational costs.
arXiv Detail & Related papers (2022-05-11T17:10:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.