Importance-Aware Data Selection for Efficient LLM Instruction Tuning
- URL: http://arxiv.org/abs/2511.07074v1
- Date: Mon, 10 Nov 2025 13:06:30 GMT
- Title: Importance-Aware Data Selection for Efficient LLM Instruction Tuning
- Authors: Tingyu Jiang, Shen Li, Yiyao Song, Lan Zhang, Hualei Zhu, Yuan Zhao, Xiaohang Xu, Kenjiro Taura, Hao Henry Wang,
- Abstract summary: We propose the Model Instruction Weakness Value (MIWV) as a novel metric to quantify the importance of instruction data in enhancing model's capabilities.<n>Our experimental results demonstrate that selecting only the top 1% of data based on MIWV can outperform training on the full dataset.
- Score: 12.894727887191621
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Instruction tuning plays a critical role in enhancing the performance and efficiency of Large Language Models (LLMs). Its success depends not only on the quality of the instruction data but also on the inherent capabilities of the LLM itself. Some studies suggest that even a small amount of high-quality data can achieve instruction fine-tuning results that are on par with, or even exceed, those from using a full-scale dataset. However, rather than focusing solely on calculating data quality scores to evaluate instruction data, there is a growing need to select high-quality data that maximally enhances the performance of instruction tuning for a given LLM. In this paper, we propose the Model Instruction Weakness Value (MIWV) as a novel metric to quantify the importance of instruction data in enhancing model's capabilities. The MIWV metric is derived from the discrepancies in the model's responses when using In-Context Learning (ICL), helping identify the most beneficial data for enhancing instruction tuning performance. Our experimental results demonstrate that selecting only the top 1\% of data based on MIWV can outperform training on the full dataset. Furthermore, this approach extends beyond existing research that focuses on data quality scoring for data selection, offering strong empirical evidence supporting the effectiveness of our proposed method.
Related papers
- Data Efficacy for Language Model Training [29.901090317084005]
Data is fundamental to the training of language models (LM)<n>Recent research has been dedicated to data efficiency, which aims to maximize performance by selecting a minimal or optimal subset of training data.<n>This work introduces a general paradigm, DELT, for considering data efficacy in LM training.
arXiv Detail & Related papers (2025-06-26T17:59:07Z) - Augmented Relevance Datasets with Fine-Tuned Small LLMs [0.7022492404644501]
This paper explores the use of small, fine-tuned large language models (LLMs) to automate relevance assessment.<n>We fine-tuned small LLMs to enhance relevance assessments, thereby improving dataset creation quality for downstream ranking model training.
arXiv Detail & Related papers (2025-04-14T02:35:00Z) - MLLM-Selector: Necessity and Diversity-driven High-Value Data Selection for Enhanced Visual Instruction Tuning [69.7347209018861]
We introduce MLLM-Selector, an automated approach that identifies valuable data for visual instruction tuning.<n>We calculate necessity scores for each sample in the VIT data pool to identify samples pivotal for enhancing model performance.<n>Our findings underscore the importance of mixing necessity and diversity in data choice, leading to the creation of MLLM-Selector.
arXiv Detail & Related papers (2025-03-26T12:42:37Z) - Star-Agents: Automatic Data Optimization with LLM Agents for Instruction Tuning [71.2981957820888]
We propose a novel Star-Agents framework, which automates the enhancement of data quality across datasets.
The framework initially generates diverse instruction data with multiple LLM agents through a bespoke sampling method.
The generated data undergo a rigorous evaluation using a dual-model method that assesses both difficulty and quality.
arXiv Detail & Related papers (2024-11-21T02:30:53Z) - IterSelectTune: An Iterative Training Framework for Efficient Instruction-Tuning Data Selection [28.581257601441045]
We introduce $textbfIterSelectTune$, an efficient, cost-effective iterative training policy for selecting high-quality instruction data.
By fine-tuning on approximately 20% of the source data, our method consistently outperforms models fine-tuned on the full dataset.
arXiv Detail & Related papers (2024-10-17T11:48:57Z) - SELF-GUIDE: Better Task-Specific Instruction Following via Self-Synthetic Finetuning [70.21358720599821]
Large language models (LLMs) hold the promise of solving diverse tasks when provided with appropriate natural language prompts.
We propose SELF-GUIDE, a multi-stage mechanism in which we synthesize task-specific input-output pairs from the student LLM.
We report an absolute improvement of approximately 15% for classification tasks and 18% for generation tasks in the benchmark's metrics.
arXiv Detail & Related papers (2024-07-16T04:41:58Z) - How to Train Data-Efficient LLMs [56.41105687693619]
We study data-efficient approaches for pre-training language models (LLMs)
We find that Ask-LLM and Density sampling are the best methods in their respective categories.
In our comparison of 19 samplers, involving hundreds of evaluation tasks and pre-training runs, we find that Ask-LLM and Density are the best methods in their respective categories.
arXiv Detail & Related papers (2024-02-15T02:27:57Z) - LESS: Selecting Influential Data for Targeted Instruction Tuning [64.78894228923619]
We propose LESS, an efficient algorithm to estimate data influences and perform Low-rank gradiEnt Similarity Search for instruction data selection.
We show that training on a LESS-selected 5% of the data can often outperform training on the full dataset across diverse downstream tasks.
Our method goes beyond surface form cues to identify data that the necessary reasoning skills for the intended downstream application.
arXiv Detail & Related papers (2024-02-06T19:18:04Z) - One-Shot Learning as Instruction Data Prospector for Large Language Models [108.81681547472138]
textscNuggets uses one-shot learning to select high-quality instruction data from extensive datasets.
We show that instruction tuning with the top 1% of examples curated by textscNuggets substantially outperforms conventional methods employing the entire dataset.
arXiv Detail & Related papers (2023-12-16T03:33:12Z) - From Quantity to Quality: Boosting LLM Performance with Self-Guided Data Selection for Instruction Tuning [52.257422715393574]
We introduce a self-guided methodology for Large Language Models (LLMs) to autonomously discern and select cherry samples from open-source datasets.
Our key innovation, the Instruction-Following Difficulty (IFD) metric, emerges as a pivotal metric to identify discrepancies between a model's expected responses and its intrinsic generation capability.
arXiv Detail & Related papers (2023-08-23T09:45:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.