Data Selection for Language Models via Importance Resampling
- URL: http://arxiv.org/abs/2302.03169v3
- Date: Sat, 18 Nov 2023 21:33:01 GMT
- Title: Data Selection for Language Models via Importance Resampling
- Authors: Sang Michael Xie, Shibani Santurkar, Tengyu Ma, Percy Liang
- Abstract summary: We formalize the problem of selecting a subset of a large raw unlabeled dataset to match a desired target distribution.
We extend the classic importance resampling approach used in low-dimensions for LM data selection.
We instantiate the DSIR framework with hashed n-gram features for efficiency, enabling the selection of 100M documents in 4.5 hours.
- Score: 90.9263039747723
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Selecting a suitable pretraining dataset is crucial for both general-domain
(e.g., GPT-3) and domain-specific (e.g., Codex) language models (LMs). We
formalize this problem as selecting a subset of a large raw unlabeled dataset
to match a desired target distribution given unlabeled target samples. Due to
the scale and dimensionality of the raw text data, existing methods use simple
heuristics or require human experts to manually curate data. Instead, we extend
the classic importance resampling approach used in low-dimensions for LM data
selection. We propose Data Selection with Importance Resampling (DSIR), an
efficient and scalable framework that estimates importance weights in a reduced
feature space for tractability and selects data with importance resampling
according to these weights. We instantiate the DSIR framework with hashed
n-gram features for efficiency, enabling the selection of 100M documents from
the full Pile dataset in 4.5 hours. To measure whether hashed n-gram features
preserve the aspects of the data that are relevant to the target, we define KL
reduction, a data metric that measures the proximity between the selected
pretraining data and the target on some feature space. Across 8 data selection
methods (including expert selection), KL reduction on hashed n-gram features
highly correlates with average downstream accuracy (r=0.82). When selecting
data for continued pretraining on a specific domain, DSIR performs comparably
to expert curation across 8 target distributions. When pretraining
general-domain models (target is Wikipedia and books), DSIR improves over
random selection and heuristic filtering baselines by 2-2.5% on the GLUE
benchmark. Code is available at https://github.com/p-lambda/dsir.
Related papers
- Adapt-$\infty$: Scalable Lifelong Multimodal Instruction Tuning via Dynamic Data Selection [89.42023974249122]
Adapt-$infty$ is a new multi-way and adaptive data selection approach for Lifelong Instruction Tuning.
We construct pseudo-skill clusters by grouping gradient-based sample vectors.
We select the best-performing data selector for each skill cluster from a pool of selector experts.
arXiv Detail & Related papers (2024-10-14T15:48:09Z) - Target-Aware Language Modeling via Granular Data Sampling [25.957424920194914]
Language model pretraining generally targets a broad range of use cases and incorporates data from diverse sources.
A cost-effective and straightforward approach is sampling with low-dimensional data features.
We show that pretrained models perform on par with the full RefinedWeb data and outperform randomly selected samples for model sizes ranging from 125M to 1.5B.
arXiv Detail & Related papers (2024-09-23T04:52:17Z) - CoLoR-Filter: Conditional Loss Reduction Filtering for Targeted Language Model Pre-training [10.511388205893295]
We propose a data selection method, CoLoR-Filter, which leverages an empirical Bayes-inspired approach to derive a simple and computationally efficient selection criterion.
CoLoR-Filter can train a 1.2b parameter target model to match a 1.2b parameter model trained on 25b randomly selected tokens with 25x less data for Books and 11x less data for the downstream tasks.
arXiv Detail & Related papers (2024-06-15T15:28:02Z) - Aligning Large Language Models with Self-generated Preference Data [72.99676237703099]
We propose a new framework that boosts the alignment of large language models (LLMs) with human preferences.
Our key idea is leveraging the human prior knowledge within the small (seed) data.
We introduce a noise-aware preference learning algorithm to mitigate the risk of low quality within generated preference data.
arXiv Detail & Related papers (2024-06-06T18:01:02Z) - Get more for less: Principled Data Selection for Warming Up Fine-Tuning in LLMs [18.242110417706]
This work focuses on leveraging and selecting from vast, unlabeled, open data to pre-fine-tune a pre-trained language model.
We show the optimality of this approach for fine-tuning tasks under certain conditions.
Our proposed method is significantly faster than existing techniques, scaling to millions of samples within a single GPU hour.
arXiv Detail & Related papers (2024-05-05T00:08:00Z) - TextGram: Towards a better domain-adaptive pretraining [0.3769303106863454]
In NLP, pre-training involves using a large amount of text data to gain prior knowledge for performing downstream tasks.
We propose our own domain-adaptive data selection method - TextGram.
We show that the proposed strategy works better compared to other selection methods.
arXiv Detail & Related papers (2024-04-28T15:44:57Z) - LESS: Selecting Influential Data for Targeted Instruction Tuning [64.78894228923619]
We propose LESS, an efficient algorithm to estimate data influences and perform Low-rank gradiEnt Similarity Search for instruction data selection.
We show that training on a LESS-selected 5% of the data can often outperform training on the full dataset across diverse downstream tasks.
Our method goes beyond surface form cues to identify data that the necessary reasoning skills for the intended downstream application.
arXiv Detail & Related papers (2024-02-06T19:18:04Z) - DsDm: Model-Aware Dataset Selection with Datamodels [81.01744199870043]
Standard practice is to filter for examples that match human notions of data quality.
We find that selecting according to similarity with "high quality" data sources may not increase (and can even hurt) performance compared to randomly selecting data.
Our framework avoids handpicked notions of data quality, and instead models explicitly how the learning process uses train datapoints to predict on the target tasks.
arXiv Detail & Related papers (2024-01-23T17:22:00Z) - Project and Probe: Sample-Efficient Domain Adaptation by Interpolating
Orthogonal Features [119.22672589020394]
We propose a lightweight, sample-efficient approach that learns a diverse set of features and adapts to a target distribution by interpolating these features.
Our experiments on four datasets, with multiple distribution shift settings for each, show that Pro$2$ improves performance by 5-15% when given limited target data.
arXiv Detail & Related papers (2023-02-10T18:58:03Z) - Automatic Document Selection for Efficient Encoder Pretraining [31.941315346316465]
We propose an alternative to larger training sets by automatically identifying smaller yet domain-representative subsets.
We treat the OntoNotes corpus as a target domain and pretrain a RoBERTa-like encoder from a cynically selected subset of the Pile.
On both perplexity and across several downstream tasks in the target domain, it consistently outperforms random selection with 20x less data, 3x fewer training iterations, and 2x less estimated cloud compute cost.
arXiv Detail & Related papers (2022-10-20T01:45:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.