Related papers: IterSelectTune: An Iterative Training Framework for Efficient Instruction-Tuning Data Selection

IterSelectTune: An Iterative Training Framework for Efficient Instruction-Tuning Data Selection

URL: http://arxiv.org/abs/2410.13464v1
Date: Thu, 17 Oct 2024 11:48:57 GMT
Title: IterSelectTune: An Iterative Training Framework for Efficient Instruction-Tuning Data Selection
Authors: Jielin Song, Siyu Liu, Bin Zhu, Yanghui Rao,
Abstract summary: We introduce $textbfIterSelectTune$, an efficient, cost-effective iterative training policy for selecting high-quality instruction data. By fine-tuning on approximately 20% of the source data, our method consistently outperforms models fine-tuned on the full dataset.
Score: 28.581257601441045
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: As large language models (LLMs) continue to advance, instruction tuning has become critical for improving their ability to generate accurate and contextually appropriate responses. Although numerous instruction-tuning datasets have been developed to enhance LLM performance, selecting high-quality instruction data from large source datasets typically demands significant human effort. In this work, we introduce $\textbf{IterSelectTune}$, an efficient, cost-effective iterative training policy for selecting high-quality instruction data with no human involvement and limited reliance on GPT-4. By fine-tuning on approximately 20\% of the source data, our method consistently outperforms models fine-tuned on the full dataset across multiple benchmarks and public test datasets. These results highlight the effectiveness of our approach in enhancing LLM performance while reducing the computational resources required for instruction tuning.

Related papers

Importance-Aware Data Selection for Efficient LLM Instruction Tuning [12.894727887191621]
We propose the Model Instruction Weakness Value (MIWV) as a novel metric to quantify the importance of instruction data in enhancing model's capabilities.<n>Our experimental results demonstrate that selecting only the top 1% of data based on MIWV can outperform training on the full dataset.
arXiv Detail & Related papers (2025-11-10T13:06:30Z)
T-SHIRT: Token-Selective Hierarchical Data Selection for Instruction Tuning [5.963754140027611]
Token-Selective HIeRarchical Data Selection for Instruction Tuning (T-SHIRT) is a novel data selection framework.<n>We demonstrate that models instruction-tuned on a curated dataset can outperform those trained on the entire large-scale dataset.
arXiv Detail & Related papers (2025-06-02T04:59:17Z)
Data Whisperer: Efficient Data Selection for Task-Specific LLM Fine-Tuning via Few-Shot In-Context Learning [40.19639581728674]
Fine-tuning large language models (LLMs) on task-specific data is essential for their effective deployment.<n>We propose Data Whisperer, an efficient, training-free, attention-based method that leverages few-shot in-context learning with the model to be fine-tuned.<n>Data Whisperer achieves superior performance compared to the full GSM8K dataset on the Llama-3-8B-Instruct model, using just 10% of the data, and outperforms existing methods with a 3.1-point improvement and a 7.4$times$ speedup.
arXiv Detail & Related papers (2025-05-18T03:10:00Z)
RICo: Refined In-Context Contribution for Automatic Instruction-Tuning Data Selection [29.459431336830267]
We propose a gradient-free method that quantifies the fine-grained contribution of individual samples to both task-level and global-level model performance.<n>We introduce a lightweight selection paradigm trained on RICo scores, enabling scalable data selection with a strictly linear inference complexity.
arXiv Detail & Related papers (2025-05-08T15:17:37Z)
RAISE: Reinforenced Adaptive Instruction Selection For Large Language Models [48.63476198469349]
We propose a task-objective-driven instruction selection framework RAISE. RAISE incorporates the entire instruction fine-tuning process into optimization. It selects instruction at each step based on the expected impact of instruction on model performance improvement.
arXiv Detail & Related papers (2025-04-09T21:17:52Z)
Aligning Instruction Tuning with Pre-training [81.4748965653345]
We propose Aligning Instruction Tuning with Pre-training (AITP) to align instruction tuning with pre-training distributions. We show consistent performance improvements with AITP on three fully open large language models (LLMs) across eight benchmarks.
arXiv Detail & Related papers (2025-01-16T08:27:40Z)
Star-Agents: Automatic Data Optimization with LLM Agents for Instruction Tuning [71.2981957820888]
We propose a novel Star-Agents framework, which automates the enhancement of data quality across datasets. The framework initially generates diverse instruction data with multiple LLM agents through a bespoke sampling method. The generated data undergo a rigorous evaluation using a dual-model method that assesses both difficulty and quality.
arXiv Detail & Related papers (2024-11-21T02:30:53Z)
Optimizing Instruction Synthesis: Effective Exploration of Evolutionary Space with Tree Search [25.108044778194536]
We introduce IDEA-MCTS (Instruction Data Enhancement using Monte Carlo Tree Search), a scalable framework for efficiently synthesizing instructions. With tree search and evaluation models, it can efficiently guide each instruction to evolve into a high-quality form, aiding in instruction fine-tuning. Experimental results show that IDEA-MCTS significantly enhances the seed instruction data, raising the average evaluation scores of quality, diversity, and complexity from 2.19 to 3.81.
arXiv Detail & Related papers (2024-10-14T11:28:30Z)
SCAR: Efficient Instruction-Tuning for Large Language Models via Style Consistency-Aware Response Ranking [56.93151679231602]
This research identifies two key stylistic elements in responses: linguistic form and semantic surprisal. Inspired by this, we introduce Style Consistency-Aware Response Ranking (SCAR) SCAR prioritizes instruction-response pairs in the training set based on their response stylistic consistency.
arXiv Detail & Related papers (2024-06-16T10:10:37Z)
Less is More: High-value Data Selection for Visual Instruction Tuning [127.38740043393527]
We propose a high-value data selection approach TIVE, to eliminate redundancy within the visual instruction data and reduce the training cost. Our approach using only about 15% data can achieve comparable average performance to the full-data fine-tuned model across eight benchmarks.
arXiv Detail & Related papers (2024-03-14T16:47:25Z)
Selective Reflection-Tuning: Student-Selected Data Recycling for LLM Instruction-Tuning [39.73918872205541]
Many recent methods focus on improving the data quality but often overlook the compatibility of the data with the student model being finetuned. This paper introduces Selective Reflection-Tuning, a novel paradigm that synergizes a teacher LLM's reflection and introspection for improving existing data quality. This teacher-student collaboration produces high-quality and student-compatible instruction-response pairs, resulting in sample-efficient instruction tuning.
arXiv Detail & Related papers (2024-02-15T17:06:21Z)
How to Train Data-Efficient LLMs [56.41105687693619]
We study data-efficient approaches for pre-training language models (LLMs) We find that Ask-LLM and Density sampling are the best methods in their respective categories. In our comparison of 19 samplers, involving hundreds of evaluation tasks and pre-training runs, we find that Ask-LLM and Density are the best methods in their respective categories.
arXiv Detail & Related papers (2024-02-15T02:27:57Z)
LESS: Selecting Influential Data for Targeted Instruction Tuning [64.78894228923619]
We propose LESS, an efficient algorithm to estimate data influences and perform Low-rank gradiEnt Similarity Search for instruction data selection. We show that training on a LESS-selected 5% of the data can often outperform training on the full dataset across diverse downstream tasks. Our method goes beyond surface form cues to identify data that the necessary reasoning skills for the intended downstream application.
arXiv Detail & Related papers (2024-02-06T19:18:04Z)
Efficient Grammatical Error Correction Via Multi-Task Training and Optimized Training Schedule [55.08778142798106]
We propose auxiliary tasks that exploit the alignment between the original and corrected sentences. We formulate each task as a sequence-to-sequence problem and perform multi-task training. We find that the order of datasets used for training and even individual instances within a dataset may have important effects on the final performance.
arXiv Detail & Related papers (2023-11-20T14:50:12Z)
Instruction Mining: Instruction Data Selection for Tuning Large Language Models [18.378654454336136]
InstructMining is designed for automatically selecting premium instruction-following data for finetuning large language models. We show that InstructMining achieves state-of-the-art performance on two of the most popular benchmarks: LLM-as-a-judge and Huggingface OpenLLM leaderboard.
arXiv Detail & Related papers (2023-07-12T16:37:31Z)
Maybe Only 0.5% Data is Needed: A Preliminary Exploration of Low Training Data Instruction Tuning [13.558918552284906]
This paper focuses on reducing the data used in instruction tuning for large language models (LLMs) to decrease training costs and improve data efficiency. The results suggest that task-specific models can be trained using less than 0.5% of the original dataset, with a 2% improvement in performance over those trained on full task-related data.
arXiv Detail & Related papers (2023-05-16T07:52:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.