Take the essence and discard the dross: A Rethinking on Data Selection for Fine-Tuning Large Language Models
- URL: http://arxiv.org/abs/2406.14115v2
- Date: Mon, 24 Feb 2025 07:59:00 GMT
- Title: Take the essence and discard the dross: A Rethinking on Data Selection for Fine-Tuning Large Language Models
- Authors: Ziche Liu, Rui Ke, Yajiao Liu, Feng Jiang, Haizhou Li,
- Abstract summary: Data selection for fine-tuning large language models (LLMs) aims to choose a high-quality subset from existing datasets.<n>Existing surveys overlook an in-depth exploration of the fine-tuning phase.<n>We introduce a novel three-stage scheme - comprising feature extraction, criteria design, and selector evaluation - to systematically categorize and evaluate these methods.
- Score: 36.22392593103493
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Data selection for fine-tuning large language models (LLMs) aims to choose a high-quality subset from existing datasets, allowing the trained model to outperform baselines trained on the full dataset. However, the expanding body of research lacks a clear, unified framework, and the variability in experimental settings complicates systematic comparisons. While existing surveys comprehensively overview the stages and methods of data selection, they often overlook an in-depth exploration of the fine-tuning phase. In this paper, we conduct a focused review of recent data selection techniques for fine-tuning LLMs, analyzing a dozen key studies. We introduce a novel three-stage scheme - comprising feature extraction, criteria design, and selector evaluation - to systematically categorize and evaluate these methods. Additionally, we propose a unified comparison approach that incorporates ratio-based efficiency and ranking-based feasibility metrics to address inconsistencies across experiments. Our findings reveal that methods emphasizing more targeted quality measurement achieve higher efficiency but at the cost of feasibility. Finally, we discuss trends and highlight four key challenges in fine-tuning data selection, offering potential directions for future research.
Related papers
- Add-One-In: Incremental Sample Selection for Large Language Models via a Choice-Based Greedy Paradigm [41.4789135538612]
This paper introduces a novel choice-based sample selection framework that shifts the focus from evaluating individual sample quality to comparing the contribution value of different samples.
Thanks to the advanced language understanding capabilities of Large Language Models (LLMs), we utilize LLMs to evaluate the value of each option during the selection process.
arXiv Detail & Related papers (2025-03-04T07:32:41Z) - TSDS: Data Selection for Task-Specific Model Finetuning [39.19448080265558]
The efficacy of task-specific finetuning largely depends on the selection of appropriate training data.
We present TSDS (Task-Specific Data Selection), a framework to select data for task-specific model finetuning.
We show that instruction tuning using data selected by our method with a 1% selection ratio often outperforms using the full dataset.
arXiv Detail & Related papers (2024-10-15T05:54:17Z) - A CLIP-Powered Framework for Robust and Generalizable Data Selection [51.46695086779598]
Real-world datasets often contain redundant and noisy data, imposing a negative impact on training efficiency and model performance.
Data selection has shown promise in identifying the most representative samples from the entire dataset.
We propose a novel CLIP-powered data selection framework that leverages multimodal information for more robust and generalizable sample selection.
arXiv Detail & Related papers (2024-10-15T03:00:58Z) - Diversify and Conquer: Diversity-Centric Data Selection with Iterative Refinement [8.509688686402438]
Finetuning large language models on instruction data is crucial for enhancing pre-trained knowledge and improving instruction-following capabilities.
This work addresses the question: How can we determine the optimal subset of data for effective training?
Our method employs k-means clustering to ensure the selected subset effectively represents the full dataset.
arXiv Detail & Related papers (2024-09-17T17:25:31Z) - An incremental preference elicitation-based approach to learning potentially non-monotonic preferences in multi-criteria sorting [53.36437745983783]
We first construct a max-margin optimization-based model to model potentially non-monotonic preferences.
We devise information amount measurement methods and question selection strategies to pinpoint the most informative alternative in each iteration.
Two incremental preference elicitation-based algorithms are developed to learn potentially non-monotonic preferences.
arXiv Detail & Related papers (2024-09-04T14:36:20Z) - Unleashing the Power of Data Tsunami: A Comprehensive Survey on Data Assessment and Selection for Instruction Tuning of Language Models [33.488331159912136]
Instruction tuning plays a critical role in aligning large language models (LLMs) with human preference.
Data assessment and selection methods have been proposed in the fields of natural language processing (NLP) and deep learning.
We present a comprehensive review on existing literature of data assessment and selection especially for instruction tuning of LLMs.
arXiv Detail & Related papers (2024-08-04T16:50:07Z) - A Two-Phase Recall-and-Select Framework for Fast Model Selection [13.385915962994806]
We propose a two-phase (coarse-recall and fine-selection) model selection framework.
It aims to enhance the efficiency of selecting a robust model by leveraging the models' training performances on benchmark datasets.
It has been demonstrated that the proposed methodology facilitates the selection of a high-performing model at a rate about 3x times faster than conventional baseline methods.
arXiv Detail & Related papers (2024-03-28T14:44:44Z) - A Survey on Data Selection for Language Models [148.300726396877]
Data selection methods aim to determine which data points to include in a training dataset.
Deep learning is mostly driven by empirical evidence and experimentation on large-scale data is expensive.
Few organizations have the resources for extensive data selection research.
arXiv Detail & Related papers (2024-02-26T18:54:35Z) - LESS: Selecting Influential Data for Targeted Instruction Tuning [64.78894228923619]
We propose LESS, an efficient algorithm to estimate data influences and perform Low-rank gradiEnt Similarity Search for instruction data selection.
We show that training on a LESS-selected 5% of the data can often outperform training on the full dataset across diverse downstream tasks.
Our method goes beyond surface form cues to identify data that the necessary reasoning skills for the intended downstream application.
arXiv Detail & Related papers (2024-02-06T19:18:04Z) - DsDm: Model-Aware Dataset Selection with Datamodels [81.01744199870043]
Standard practice is to filter for examples that match human notions of data quality.
We find that selecting according to similarity with "high quality" data sources may not increase (and can even hurt) performance compared to randomly selecting data.
Our framework avoids handpicked notions of data quality, and instead models explicitly how the learning process uses train datapoints to predict on the target tasks.
arXiv Detail & Related papers (2024-01-23T17:22:00Z) - A data-science pipeline to enable the Interpretability of Many-Objective
Feature Selection [0.1474723404975345]
Many-Objective Feature Selection (MOFS) approaches use four or more objectives to determine the relevance of a subset of features in a supervised learning task.
This paper proposes an original methodology to support data scientists in the interpretation and comparison of the MOFS outcome by combining post-processing and visualisation of the set of solutions.
arXiv Detail & Related papers (2023-11-30T17:44:22Z) - Towards Free Data Selection with General-Purpose Models [71.92151210413374]
A desirable data selection algorithm can efficiently choose the most informative samples to maximize the utility of limited annotation budgets.
Current approaches, represented by active learning methods, typically follow a cumbersome pipeline that iterates the time-consuming model training and batch data selection repeatedly.
FreeSel bypasses the heavy batch selection process, achieving a significant improvement in efficiency and being 530x faster than existing active learning methods.
arXiv Detail & Related papers (2023-09-29T15:50:14Z) - Empirical Analysis of Model Selection for Heterogeneous Causal Effect Estimation [24.65301562548798]
We study the problem of model selection in causal inference, specifically for conditional average treatment effect (CATE) estimation.
We conduct an empirical analysis to benchmark the surrogate model selection metrics introduced in the literature, as well as the novel ones introduced in this work.
arXiv Detail & Related papers (2022-11-03T16:26:06Z) - Compactness Score: A Fast Filter Method for Unsupervised Feature
Selection [66.84571085643928]
We propose a fast unsupervised feature selection method, named as, Compactness Score (CSUFS) to select desired features.
Our proposed algorithm seems to be more accurate and efficient compared with existing algorithms.
arXiv Detail & Related papers (2022-01-31T13:01:37Z) - Online Coreset Selection for Rehearsal-based Continual Learning [65.85595842458882]
In continual learning, we store a subset of training examples (coreset) to be replayed later to alleviate catastrophic forgetting.
We propose Online Coreset Selection (OCS), a simple yet effective method that selects the most representative and informative coreset at each iteration.
Our proposed method maximizes the model's adaptation to a target dataset while selecting high-affinity samples to past tasks, which directly inhibits catastrophic forgetting.
arXiv Detail & Related papers (2021-06-02T11:39:25Z) - Model-specific Data Subsampling with Influence Functions [37.64859614131316]
We develop a model-specific data subsampling strategy that improves over random sampling whenever training points have varying influence.
Specifically, we leverage influence functions to guide our selection strategy, proving theoretically, and demonstrating empirically that our approach quickly selects high-quality models.
arXiv Detail & Related papers (2020-10-20T12:10:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.