Related papers: When Less is More: Investigating Data Pruning for Pretraining LLMs at Scale

When Less is More: Investigating Data Pruning for Pretraining LLMs at Scale

URL: http://arxiv.org/abs/2309.04564v1
Date: Fri, 8 Sep 2023 19:34:05 GMT
Title: When Less is More: Investigating Data Pruning for Pretraining LLMs at Scale
Authors: Max Marion, Ahmet \"Ust\"un, Luiza Pozzobon, Alex Wang, Marzieh Fadaee, Sara Hooker
Abstract summary: Large volumes of text data have contributed significantly to the development of large language models. To date, efforts to prune datasets down to a higher quality subset have relied on hand-crafteds encoded as rule-based filters. We take a wider view and explore scalable estimates of data quality that can be used to measure the quality of pretraining data.
Score: 12.94829977468838
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large volumes of text data have contributed significantly to the development of large language models (LLMs) in recent years. This data is typically acquired by scraping the internet, leading to pretraining datasets comprised of noisy web text. To date, efforts to prune these datasets down to a higher quality subset have relied on hand-crafted heuristics encoded as rule-based filters. In this work, we take a wider view and explore scalable estimates of data quality that can be used to systematically measure the quality of pretraining data. We perform a rigorous comparison at scale of the simple data quality estimator of perplexity, as well as more sophisticated and computationally intensive estimates of the Error L2-Norm and memorization. These metrics are used to rank and prune pretraining corpora, and we subsequently compare LLMs trained on these pruned datasets. Surprisingly, we find that the simple technique of perplexity outperforms our more computationally expensive scoring methods. We improve over our no-pruning baseline while training on as little as 30% of the original training dataset. Our work sets the foundation for unexplored strategies in automatically curating high quality corpora and suggests the majority of pretraining data can be removed while retaining performance.

Related papers

Towards Data-Efficient Pretraining for Atomic Property Prediction [51.660835328611626]
We show that pretraining on a task-relevant dataset can match or surpass large-scale pretraining. We introduce the Chemical Similarity Index (CSI), a novel metric inspired by computer vision's Fr'echet Inception Distance.
arXiv Detail & Related papers (2025-02-16T11:46:23Z)
Optimizing Pretraining Data Mixtures with LLM-Estimated Utility [52.08428597962423]
Large Language Models improve with increasing amounts of high-quality training data. We find token-counts outperform manual and learned mixes, indicating that simple approaches for dataset size and diversity are surprisingly effective. We propose two complementary approaches: UtiliMax, which extends token-based $200s by incorporating utility estimates from reduced-scale ablations, achieving up to a 10.6x speedup over manual baselines; and Model Estimated Data Utility (MEDU), which leverages LLMs to estimate data utility from small samples, matching ablation-based performance while reducing computational requirements by $simx.
arXiv Detail & Related papers (2025-01-20T21:10:22Z)
Improving Pretraining Data Using Perplexity Correlations [56.41097718862742]
We build a new statistical framework for data selection centered around estimates of perplexity-benchmark correlations. In controlled pretraining experiments at the 160M parameter scale on 8 benchmarks, our approach outperforms DSIR on every benchmark.
arXiv Detail & Related papers (2024-09-09T17:23:29Z)
How to Train Data-Efficient LLMs [56.41105687693619]
We study data-efficient approaches for pre-training language models (LLMs) We find that Ask-LLM and Density sampling are the best methods in their respective categories. In our comparison of 19 samplers, involving hundreds of evaluation tasks and pre-training runs, we find that Ask-LLM and Density are the best methods in their respective categories.
arXiv Detail & Related papers (2024-02-15T02:27:57Z)
Data Filtering Networks [67.827994353269]
We study the problem of learning a data filtering network (DFN) for this second step of filtering a large uncurated dataset. Our key finding is that the quality of a network for filtering is distinct from its performance on downstream tasks. Based on our insights, we construct new data filtering networks that induce state-of-the-art image-text datasets.
arXiv Detail & Related papers (2023-09-29T17:37:29Z)
D4: Improving LLM Pretraining via Document De-Duplication and Diversification [38.84592304799403]
We show that careful data selection via pre-trained model embeddings can speed up training. We also show that repeating data intelligently consistently outperforms baseline training.
arXiv Detail & Related papers (2023-08-23T17:58:14Z)
Revisit Few-shot Intent Classification with PLMs: Direct Fine-tuning vs. Continual Pre-training [20.98770732015944]
Few-shot intent detection involves training a deep learning model to classify utterances based on their underlying intents using only a small amount of labeled data. We show that continual pre-training may not be essential, since the overfitting problem of PLMs on this task may not be as serious as expected. To maximize the utilization of the limited available data, we propose a context augmentation method and leverage sequential self-distillation to boost performance.
arXiv Detail & Related papers (2023-06-08T15:26:52Z)
Downstream Datasets Make Surprisingly Good Pretraining Corpora [39.77171117174906]
This paper introduces a large-scale study of self-pretraining, where the same (downstream) training data is used for both pretraining and finetuning. In experiments addressing both ELECTRA and RoBERTa models and 10 distinct downstream classification datasets, we observe that self-pretraining rivals standard pretraining on the BookWiki corpus. Our results hint that performance gains attributable to pretraining are driven primarily by the pretraining objective itself and are not always attributable to the use of external pretraining data in massive amounts.
arXiv Detail & Related papers (2022-09-28T19:28:43Z)
Efficient Conditional Pre-training for Transfer Learning [71.01129334495553]
We propose efficient filtering methods to select relevant subsets from the pre-training dataset. We validate our techniques by pre-training on ImageNet in both the unsupervised and supervised settings. We improve standard ImageNet pre-training by 1-3% by tuning available models on our subsets and pre-training on a dataset filtered from a larger scale dataset.
arXiv Detail & Related papers (2020-11-20T06:16:15Z)
DAGA: Data Augmentation with a Generation Approach for Low-resource Tagging Tasks [88.62288327934499]
We propose a novel augmentation method with language models trained on the linearized labeled sentences. Our method is applicable to both supervised and semi-supervised settings.
arXiv Detail & Related papers (2020-11-03T07:49:15Z)
Omni-supervised Facial Expression Recognition via Distilled Data [120.11782405714234]
We propose omni-supervised learning to exploit reliable samples in a large amount of unlabeled data for network training. We experimentally verify that the new dataset can significantly improve the ability of the learned FER model. To tackle this, we propose to apply a dataset distillation strategy to compress the created dataset into several informative class-wise images.
arXiv Detail & Related papers (2020-05-18T09:36:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.