Related papers: TSDS: Data Selection for Task-Specific Model Finetuning

TSDS: Data Selection for Task-Specific Model Finetuning

URL: http://arxiv.org/abs/2410.11303v2
Date: Wed, 23 Oct 2024 03:00:41 GMT
Title: TSDS: Data Selection for Task-Specific Model Finetuning
Authors: Zifan Liu, Amin Karbasi, Theodoros Rekatsinas,
Abstract summary: The efficacy of task-specific finetuning largely depends on the selection of appropriate training data. We present TSDS (Task-Specific Data Selection), a framework to select data for task-specific model finetuning. We show that instruction tuning using data selected by our method with a 1% selection ratio often outperforms using the full dataset.
Score: 39.19448080265558
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Finetuning foundation models for specific tasks is an emerging paradigm in modern machine learning. The efficacy of task-specific finetuning largely depends on the selection of appropriate training data. We present TSDS (Task-Specific Data Selection), a framework to select data for task-specific model finetuning, guided by a small but representative set of examples from the target task. To do so, we formulate data selection for task-specific finetuning as an optimization problem with a distribution alignment loss based on optimal transport to capture the discrepancy between the selected data and the target distribution. In addition, we add a regularizer to encourage the diversity of the selected data and incorporate kernel density estimation into the regularizer to reduce the negative effects of near-duplicates among the candidate data. We connect our optimization problem to nearest neighbor search and design efficient algorithms to compute the optimal solution based on approximate nearest neighbor search techniques. We evaluate our method on data selection for both continued pretraining and instruction tuning of language models. We show that instruction tuning using data selected by our method with a 1% selection ratio often outperforms using the full dataset and beats the baseline selection methods by 1.5 points in F1 score on average.

Related papers

ProDS: Preference-oriented Data Selection for Instruction Tuning [16.416482636797504]
We propose Preference-oriented Data Selection method (ProDS) that scores training samples based on their alignment with preferences observed in the target set.<n>Our key innovation lies in shifting the data selection criteria from merely estimating features for accurate response generation to explicitly aligning training samples with human preferences in target tasks.
arXiv Detail & Related papers (2025-05-19T06:28:14Z)
Algorithm Selection with Probing Trajectories: Benchmarking the Choice of Classifier Model [0.20718016474717196]
We conduct a benchmark study using 17 different classifiers and three types of trajectory on a classification task using the BBOB benchmark suite. We find that the choice of classifier has a significant impact, showing that feature-based and interval-based models are the best choices.
arXiv Detail & Related papers (2025-01-20T11:28:45Z)
ROSE: A Reward-Oriented Data Selection Framework for LLM Task-Specific Instruction Tuning [29.001249598245]
We introduce Reward-Oriented inStruction data sElection to optimize data selection for task-specific instruction tuning. ROSE adapts an influence formulation to approximate the influence of training data points relative to a few-shot preference validation set to select the most task-related training data points.
arXiv Detail & Related papers (2024-12-01T01:01:09Z)
TAROT: Targeted Data Selection via Optimal Transport [64.56083922130269]
TAROT is a targeted data selection framework grounded in optimal transport theory. Previous targeted data selection methods rely on influence-based greedys to enhance domain-specific performance. We evaluate TAROT across multiple tasks, including semantic segmentation, motion prediction, and instruction tuning.
arXiv Detail & Related papers (2024-11-30T10:19:51Z)
Compute-Constrained Data Selection [77.06528009072967]
We find that many powerful data selection methods are almost never compute-optimal. For compute-optimal training, we find that perplexity and gradient data selection require training-to-selection model size ratios of 5x and 10x, respectively.
arXiv Detail & Related papers (2024-10-21T17:11:21Z)
Adapt-$\infty$: Scalable Lifelong Multimodal Instruction Tuning via Dynamic Data Selection [89.42023974249122]
Adapt-$infty$ is a new multi-way and adaptive data selection approach for Lifelong Instruction Tuning. We construct pseudo-skill clusters by grouping gradient-based sample vectors. We select the best-performing data selector for each skill cluster from a pool of selector experts.
arXiv Detail & Related papers (2024-10-14T15:48:09Z)
Scalable Fine-tuning from Multiple Data Sources: A First-Order Approximation Approach [17.79010397902909]
We study the problem of fine-tuning a language model (LM) for a target task by optimally using the information from $n$ auxiliary tasks. This problem has broad applications in NLP, such as targeted instruction tuning and data selection in chain-of-thought fine-tuning. We introduce a new algorithm to estimate model fine-tuning performances without repeated training.
arXiv Detail & Related papers (2024-09-28T21:26:50Z)
Take the essence and discard the dross: A Rethinking on Data Selection for Fine-Tuning Large Language Models [38.39395973523944]
We propose a three-stage scheme for data selection and review existing works according to this scheme. We find that the more targeted method with data-specific and model-specific quality labels has higher efficiency.
arXiv Detail & Related papers (2024-06-20T08:58:58Z)
BWS: Best Window Selection Based on Sample Scores for Data Pruning across Broad Ranges [12.248397169100784]
Data subset selection aims to find a smaller yet informative subset of a large dataset that can approximate the full-dataset training. We introduce a universal and efficient data subset selection method, Best Window Selection (BWS), by proposing a method to choose the best window subset from samples ordered based on their difficulty scores.
arXiv Detail & Related papers (2024-06-05T08:33:09Z)
Get more for less: Principled Data Selection for Warming Up Fine-Tuning in LLMs [18.242110417706]
This work focuses on leveraging and selecting from vast, unlabeled, open data to pre-fine-tune a pre-trained language model. We show the optimality of this approach for fine-tuning tasks under certain conditions. Our proposed method is significantly faster than existing techniques, scaling to millions of samples within a single GPU hour.
arXiv Detail & Related papers (2024-05-05T00:08:00Z)
DsDm: Model-Aware Dataset Selection with Datamodels [81.01744199870043]
Standard practice is to filter for examples that match human notions of data quality. We find that selecting according to similarity with "high quality" data sources may not increase (and can even hurt) performance compared to randomly selecting data. Our framework avoids handpicked notions of data quality, and instead models explicitly how the learning process uses train datapoints to predict on the target tasks.
arXiv Detail & Related papers (2024-01-23T17:22:00Z)
Efficient Online Data Mixing For Language Model Pre-Training [101.45242332613944]
Existing data selection methods suffer from slow and computationally expensive processes. Data mixing, on the other hand, reduces the complexity of data selection by grouping data points together. We develop an efficient algorithm for Online Data Mixing (ODM) that combines elements from both data selection and data mixing.
arXiv Detail & Related papers (2023-12-05T00:42:35Z)
D2 Pruning: Message Passing for Balancing Diversity and Difficulty in Data Pruning [70.98091101459421]
Coreset selection seeks to select a subset of the training data so as to maximize the performance of models trained on this subset, also referred to as coreset. We propose a novel pruning algorithm, D2 Pruning, that uses forward and reverse message passing over this dataset graph for coreset selection. Results show that D2 Pruning improves coreset selection over previous state-of-the-art methods for up to 70% pruning rates.
arXiv Detail & Related papers (2023-10-11T23:01:29Z)
Data Selection for Language Models via Importance Resampling [90.9263039747723]
We formalize the problem of selecting a subset of a large raw unlabeled dataset to match a desired target distribution. We extend the classic importance resampling approach used in low-dimensions for LM data selection. We instantiate the DSIR framework with hashed n-gram features for efficiency, enabling the selection of 100M documents in 4.5 hours.
arXiv Detail & Related papers (2023-02-06T23:57:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.