DsDm: Model-Aware Dataset Selection with Datamodels
- URL: http://arxiv.org/abs/2401.12926v1
- Date: Tue, 23 Jan 2024 17:22:00 GMT
- Title: DsDm: Model-Aware Dataset Selection with Datamodels
- Authors: Logan Engstrom, Axel Feldmann, Aleksander Madry
- Abstract summary: Standard practice is to filter for examples that match human notions of data quality.
We find that selecting according to similarity with "high quality" data sources may not increase (and can even hurt) performance compared to randomly selecting data.
Our framework avoids handpicked notions of data quality, and instead models explicitly how the learning process uses train datapoints to predict on the target tasks.
- Score: 81.01744199870043
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: When selecting data for training large-scale models, standard practice is to
filter for examples that match human notions of data quality. Such filtering
yields qualitatively clean datapoints that intuitively should improve model
behavior. However, in practice the opposite can often happen: we find that
selecting according to similarity with "high quality" data sources may not
increase (and can even hurt) performance compared to randomly selecting data.
To develop better methods for selecting data, we start by framing dataset
selection as an optimization problem that we can directly solve for: given
target tasks, a learning algorithm, and candidate data, select the subset that
maximizes model performance. This framework thus avoids handpicked notions of
data quality, and instead models explicitly how the learning process uses train
datapoints to predict on the target tasks. Our resulting method greatly
improves language model (LM) performance on both pre-specified tasks and
previously unseen tasks. Specifically, choosing target tasks representative of
standard LM problems and evaluating on diverse held-out benchmarks, our
selected datasets provide a 2x compute multiplier over baseline methods.
Related papers
- TSDS: Data Selection for Task-Specific Model Finetuning [39.19448080265558]
The efficacy of task-specific finetuning largely depends on the selection of appropriate training data.
We present TSDS (Task-Specific Data Selection), a framework to select data for task-specific model finetuning.
We show that instruction tuning using data selected by our method with a 1% selection ratio often outperforms using the full dataset.
arXiv Detail & Related papers (2024-10-15T05:54:17Z) - Adapt-$\infty$: Scalable Lifelong Multimodal Instruction Tuning via Dynamic Data Selection [89.42023974249122]
Adapt-$infty$ is a new multi-way and adaptive data selection approach for Lifelong Instruction Tuning.
We construct pseudo-skill clusters by grouping gradient-based sample vectors.
We select the best-performing data selector for each skill cluster from a pool of selector experts.
arXiv Detail & Related papers (2024-10-14T15:48:09Z) - Take the essence and discard the dross: A Rethinking on Data Selection for Fine-Tuning Large Language Models [38.39395973523944]
We propose a three-stage scheme for data selection and review existing works according to this scheme.
We find that the more targeted method with data-specific and model-specific quality labels has higher efficiency.
arXiv Detail & Related papers (2024-06-20T08:58:58Z) - TextGram: Towards a better domain-adaptive pretraining [0.3769303106863454]
In NLP, pre-training involves using a large amount of text data to gain prior knowledge for performing downstream tasks.
We propose our own domain-adaptive data selection method - TextGram.
We show that the proposed strategy works better compared to other selection methods.
arXiv Detail & Related papers (2024-04-28T15:44:57Z) - How to Train Data-Efficient LLMs [56.41105687693619]
We study data-efficient approaches for pre-training language models (LLMs)
We find that Ask-LLM and Density sampling are the best methods in their respective categories.
In our comparison of 19 samplers, involving hundreds of evaluation tasks and pre-training runs, we find that Ask-LLM and Density are the best methods in their respective categories.
arXiv Detail & Related papers (2024-02-15T02:27:57Z) - LESS: Selecting Influential Data for Targeted Instruction Tuning [64.78894228923619]
We propose LESS, an efficient algorithm to estimate data influences and perform Low-rank gradiEnt Similarity Search for instruction data selection.
We show that training on a LESS-selected 5% of the data can often outperform training on the full dataset across diverse downstream tasks.
Our method goes beyond surface form cues to identify data that the necessary reasoning skills for the intended downstream application.
arXiv Detail & Related papers (2024-02-06T19:18:04Z) - Efficient Online Data Mixing For Language Model Pre-Training [101.45242332613944]
Existing data selection methods suffer from slow and computationally expensive processes.
Data mixing, on the other hand, reduces the complexity of data selection by grouping data points together.
We develop an efficient algorithm for Online Data Mixing (ODM) that combines elements from both data selection and data mixing.
arXiv Detail & Related papers (2023-12-05T00:42:35Z) - D2 Pruning: Message Passing for Balancing Diversity and Difficulty in
Data Pruning [70.98091101459421]
Coreset selection seeks to select a subset of the training data so as to maximize the performance of models trained on this subset, also referred to as coreset.
We propose a novel pruning algorithm, D2 Pruning, that uses forward and reverse message passing over this dataset graph for coreset selection.
Results show that D2 Pruning improves coreset selection over previous state-of-the-art methods for up to 70% pruning rates.
arXiv Detail & Related papers (2023-10-11T23:01:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.