Related papers: Compute-Constrained Data Selection

Compute-Constrained Data Selection

URL: http://arxiv.org/abs/2410.16208v3
Date: Mon, 02 Dec 2024 18:59:28 GMT
Title: Compute-Constrained Data Selection
Authors: Junjie Oscar Yin, Alexander M. Rush,
Abstract summary: We find that many powerful data selection methods are almost never compute-optimal.<n>For compute-optimal training, we find that perplexity and gradient data selection require training-to-selection model size ratios of 5x and 10x, respectively.
Score: 77.06528009072967
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Data selection can reduce the amount of training data needed to finetune LLMs; however, the efficacy of data selection scales directly with its compute. Motivated by the practical challenge of compute-constrained finetuning, we consider the setting in which both the cost of selecting data and training are budgeted for. We first formalize the problem of data selection with a cost-aware utility function, and model the data selection problem as trading off initial-selection cost for training gain. We run a comprehensive sweep of experiments across multiple tasks, varying compute budget by scaling finetuning tokens, model sizes, and data selection compute. Interestingly we find that many powerful data selection methods are almost never compute-optimal, and that cheaper data selection alternatives dominate both from a theoretical and empirical perspective. For compute-optimal training, we find that perplexity and gradient data selection require training-to-selection model size ratios of 5x and 10x, respectively.

Related papers

Computational Budget Should Be Considered in Data Selection [21.598075666695483]
We argue that compute budget must be integral to data-selection strategies.<n>We propose a novel Computational budget-Aware Data Selection (CADS) method.<n>Our method achieves performance gains of up to 14.42% over baselines in vision and language benchmarks.
arXiv Detail & Related papers (2025-10-19T12:16:43Z)
RL-Guided Data Selection for Language Model Finetuning [3.477926761611361]
We propose a tractable Markov Decision Process (MDP) and train agents using various Reinforcement Learning (RL) methods to learn optimal data selection policies.<n>Across four datasets, training on a $5%$ subset selected by our approach matches or outperforms fine-tuning on the full dataset by up to $10.8$ accuracy points.
arXiv Detail & Related papers (2025-09-30T06:42:19Z)
Data Selection for ERMs [67.57726352698933]
We study how well can $mathcalA$ perform when trained on at most $nll N$ data points selected from a population of $N$ points. Our results include optimal data-selection bounds for mean estimation, linear classification, and linear regression.
arXiv Detail & Related papers (2025-04-20T11:26:01Z)
TSDS: Data Selection for Task-Specific Model Finetuning [39.19448080265558]
The efficacy of task-specific finetuning largely depends on the selection of appropriate training data. We present TSDS (Task-Specific Data Selection), a framework to select data for task-specific model finetuning. We show that instruction tuning using data selected by our method with a 1% selection ratio often outperforms using the full dataset.
arXiv Detail & Related papers (2024-10-15T05:54:17Z)
A CLIP-Powered Framework for Robust and Generalizable Data Selection [51.46695086779598]
Real-world datasets often contain redundant and noisy data, imposing a negative impact on training efficiency and model performance. Data selection has shown promise in identifying the most representative samples from the entire dataset. We propose a novel CLIP-powered data selection framework that leverages multimodal information for more robust and generalizable sample selection.
arXiv Detail & Related papers (2024-10-15T03:00:58Z)
Avoid Wasted Annotation Costs in Open-set Active Learning with Pre-trained Vision-Language Model [3.647905567437244]
Active learning (AL) aims to enhance model performance by selectively collecting highly informative data. In practical scenarios, unlabeled data may contain out-of-distribution (OOD) samples, leading to wasted annotation costs. We propose a novel selection strategy, CLIPN for AL (CLIPNAL), which minimizes cost losses without requiring OOD samples.
arXiv Detail & Related papers (2024-08-09T07:54:57Z)
Data curation via joint example selection further accelerates multimodal learning [3.329535792151987]
We show that jointly selecting batches of data is more effective for learning than selecting examples independently. We derive a simple and tractable algorithm for selecting such batches, which significantly accelerate training beyond individually-prioritized data points.
arXiv Detail & Related papers (2024-06-25T16:52:37Z)
TextGram: Towards a better domain-adaptive pretraining [0.3769303106863454]
In NLP, pre-training involves using a large amount of text data to gain prior knowledge for performing downstream tasks. We propose our own domain-adaptive data selection method - TextGram. We show that the proposed strategy works better compared to other selection methods.
arXiv Detail & Related papers (2024-04-28T15:44:57Z)
LESS: Selecting Influential Data for Targeted Instruction Tuning [64.78894228923619]
We propose LESS, an efficient algorithm to estimate data influences and perform Low-rank gradiEnt Similarity Search for instruction data selection. We show that training on a LESS-selected 5% of the data can often outperform training on the full dataset across diverse downstream tasks. Our method goes beyond surface form cues to identify data that the necessary reasoning skills for the intended downstream application.
arXiv Detail & Related papers (2024-02-06T19:18:04Z)
DsDm: Model-Aware Dataset Selection with Datamodels [81.01744199870043]
Standard practice is to filter for examples that match human notions of data quality. We find that selecting according to similarity with "high quality" data sources may not increase (and can even hurt) performance compared to randomly selecting data. Our framework avoids handpicked notions of data quality, and instead models explicitly how the learning process uses train datapoints to predict on the target tasks.
arXiv Detail & Related papers (2024-01-23T17:22:00Z)
Efficient Online Data Mixing For Language Model Pre-Training [101.45242332613944]
Existing data selection methods suffer from slow and computationally expensive processes. Data mixing, on the other hand, reduces the complexity of data selection by grouping data points together. We develop an efficient algorithm for Online Data Mixing (ODM) that combines elements from both data selection and data mixing.
arXiv Detail & Related papers (2023-12-05T00:42:35Z)
Towards Free Data Selection with General-Purpose Models [71.92151210413374]
A desirable data selection algorithm can efficiently choose the most informative samples to maximize the utility of limited annotation budgets. Current approaches, represented by active learning methods, typically follow a cumbersome pipeline that iterates the time-consuming model training and batch data selection repeatedly. FreeSel bypasses the heavy batch selection process, achieving a significant improvement in efficiency and being 530x faster than existing active learning methods.
arXiv Detail & Related papers (2023-09-29T15:50:14Z)
Towards Accelerated Model Training via Bayesian Data Selection [45.62338106716745]
We propose a more reasonable data selection principle by examining the data's impact on the model's generalization loss. Recent work has proposed a more reasonable data selection principle by examining the data's impact on the model's generalization loss. This work solves these problems by leveraging a lightweight Bayesian treatment and incorporating off-the-shelf zero-shot predictors built on large-scale pre-trained models.
arXiv Detail & Related papers (2023-08-21T07:58:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.