When Less is More: Investigating Data Pruning for Pretraining LLMs at
  Scale
        - URL: http://arxiv.org/abs/2309.04564v1
- Date: Fri, 8 Sep 2023 19:34:05 GMT
- Title: When Less is More: Investigating Data Pruning for Pretraining LLMs at
  Scale
- Authors: Max Marion, Ahmet \"Ust\"un, Luiza Pozzobon, Alex Wang, Marzieh
  Fadaee, Sara Hooker
- Abstract summary: Large volumes of text data have contributed significantly to the development of large language models.
To date, efforts to prune datasets down to a higher quality subset have relied on hand-crafteds encoded as rule-based filters.
We take a wider view and explore scalable estimates of data quality that can be used to measure the quality of pretraining data.
- Score: 12.94829977468838
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract:   Large volumes of text data have contributed significantly to the development
of large language models (LLMs) in recent years. This data is typically
acquired by scraping the internet, leading to pretraining datasets comprised of
noisy web text. To date, efforts to prune these datasets down to a higher
quality subset have relied on hand-crafted heuristics encoded as rule-based
filters. In this work, we take a wider view and explore scalable estimates of
data quality that can be used to systematically measure the quality of
pretraining data. We perform a rigorous comparison at scale of the simple data
quality estimator of perplexity, as well as more sophisticated and
computationally intensive estimates of the Error L2-Norm and memorization.
These metrics are used to rank and prune pretraining corpora, and we
subsequently compare LLMs trained on these pruned datasets. Surprisingly, we
find that the simple technique of perplexity outperforms our more
computationally expensive scoring methods. We improve over our no-pruning
baseline while training on as little as 30% of the original training dataset.
Our work sets the foundation for unexplored strategies in automatically
curating high quality corpora and suggests the majority of pretraining data can
be removed while retaining performance.
 
      
        Related papers
        - Recycling the Web: A Method to Enhance Pre-training Data Quality and   Quantity for Language Models [107.24906866038431]
 We propose REWIRE, REcycling the Web with guIded REwrite, to enrich low-quality documents so that they could become useful for training.<n>We show that mixing high-quality raw texts and our rewritten texts lead to 1.0, 1.3 and 2.5 percentage points improvement respectively across 22 diverse tasks.
 arXiv  Detail & Related papers  (2025-06-05T07:12:12Z)
- Towards Data-Efficient Pretraining for Atomic Property Prediction [51.660835328611626]
 We show that pretraining on a task-relevant dataset can match or surpass large-scale pretraining.
We introduce the Chemical Similarity Index (CSI), a novel metric inspired by computer vision's Fr'echet Inception Distance.
 arXiv  Detail & Related papers  (2025-02-16T11:46:23Z)
- Optimizing Pretraining Data Mixtures with LLM-Estimated Utility [52.08428597962423]
 Large Language Models improve with increasing amounts of high-quality training data.
We find token-counts outperform manual and learned mixes, indicating that simple approaches for dataset size and diversity are surprisingly effective.
We propose two complementary approaches: UtiliMax, which extends token-based $200s by incorporating utility estimates from reduced-scale ablations, achieving up to a 10.6x speedup over manual baselines; and Model Estimated Data Utility (MEDU), which leverages LLMs to estimate data utility from small samples, matching ablation-based performance while reducing computational requirements by $simx.
 arXiv  Detail & Related papers  (2025-01-20T21:10:22Z)
- Improving Pretraining Data Using Perplexity Correlations [56.41097718862742]
 We build a new statistical framework for data selection centered around estimates of perplexity-benchmark correlations.
In controlled pretraining experiments at the 160M parameter scale on 8 benchmarks, our approach outperforms DSIR on every benchmark.
 arXiv  Detail & Related papers  (2024-09-09T17:23:29Z)
- How to Train Data-Efficient LLMs [56.41105687693619]
 We study data-efficient approaches for pre-training language models (LLMs)
We find that Ask-LLM and Density sampling are the best methods in their respective categories.
In our comparison of 19 samplers, involving hundreds of evaluation tasks and pre-training runs, we find that Ask-LLM and Density are the best methods in their respective categories.
 arXiv  Detail & Related papers  (2024-02-15T02:27:57Z)
- Data Filtering Networks [67.827994353269]
 We study the problem of learning a data filtering network (DFN) for this second step of filtering a large uncurated dataset.
Our key finding is that the quality of a network for filtering is distinct from its performance on downstream tasks.
Based on our insights, we construct new data filtering networks that induce state-of-the-art image-text datasets.
 arXiv  Detail & Related papers  (2023-09-29T17:37:29Z)
- D4: Improving LLM Pretraining via Document De-Duplication and
  Diversification [38.84592304799403]
 We show that careful data selection via pre-trained model embeddings can speed up training.
We also show that repeating data intelligently consistently outperforms baseline training.
 arXiv  Detail & Related papers  (2023-08-23T17:58:14Z)
- Revisit Few-shot Intent Classification with PLMs: Direct Fine-tuning vs.   Continual Pre-training [20.98770732015944]
 Few-shot intent detection involves training a deep learning model to classify utterances based on their underlying intents using only a small amount of labeled data.
We show that continual pre-training may not be essential, since the overfitting problem of PLMs on this task may not be as serious as expected.
To maximize the utilization of the limited available data, we propose a context augmentation method and leverage sequential self-distillation to boost performance.
 arXiv  Detail & Related papers  (2023-06-08T15:26:52Z)
- Downstream Datasets Make Surprisingly Good Pretraining Corpora [39.77171117174906]
 This paper introduces a large-scale study of self-pretraining, where the same (downstream) training data is used for both pretraining and finetuning.
In experiments addressing both ELECTRA and RoBERTa models and 10 distinct downstream classification datasets, we observe that self-pretraining rivals standard pretraining on the BookWiki corpus.
Our results hint that performance gains attributable to pretraining are driven primarily by the pretraining objective itself and are not always attributable to the use of external pretraining data in massive amounts.
 arXiv  Detail & Related papers  (2022-09-28T19:28:43Z)
- Efficient Conditional Pre-training for Transfer Learning [71.01129334495553]
 We propose efficient filtering methods to select relevant subsets from the pre-training dataset.
We validate our techniques by pre-training on ImageNet in both the unsupervised and supervised settings.
We improve standard ImageNet pre-training by 1-3% by tuning available models on our subsets and pre-training on a dataset filtered from a larger scale dataset.
 arXiv  Detail & Related papers  (2020-11-20T06:16:15Z)
- DAGA: Data Augmentation with a Generation Approach for Low-resource
  Tagging Tasks [88.62288327934499]
 We propose a novel augmentation method with language models trained on the linearized labeled sentences.
Our method is applicable to both supervised and semi-supervised settings.
 arXiv  Detail & Related papers  (2020-11-03T07:49:15Z)
- Omni-supervised Facial Expression Recognition via Distilled Data [120.11782405714234]
 We propose omni-supervised learning to exploit reliable samples in a large amount of unlabeled data for network training.
We experimentally verify that the new dataset can significantly improve the ability of the learned FER model.
To tackle this, we propose to apply a dataset distillation strategy to compress the created dataset into several informative class-wise images.
 arXiv  Detail & Related papers  (2020-05-18T09:36:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
       
     
           This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.