Related papers: D4: Improving LLM Pretraining via Document De-Duplication and Diversification

D4: Improving LLM Pretraining via Document De-Duplication and Diversification

URL: http://arxiv.org/abs/2308.12284v1
Date: Wed, 23 Aug 2023 17:58:14 GMT
Title: D4: Improving LLM Pretraining via Document De-Duplication and Diversification
Authors: Kushal Tirumala, Daniel Simig, Armen Aghajanyan, Ari S. Morcos
Abstract summary: We show that careful data selection via pre-trained model embeddings can speed up training. We also show that repeating data intelligently consistently outperforms baseline training.
Score: 38.84592304799403
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Over recent years, an increasing amount of compute and data has been poured into training large language models (LLMs), usually by doing one-pass learning on as many tokens as possible randomly selected from large-scale web corpora. While training on ever-larger portions of the internet leads to consistent performance improvements, the size of these improvements diminishes with scale, and there has been little work exploring the effect of data selection on pre-training and downstream performance beyond simple de-duplication methods such as MinHash. Here, we show that careful data selection (on top of de-duplicated data) via pre-trained model embeddings can speed up training (20% efficiency gains) and improves average downstream accuracy on 16 NLP tasks (up to 2%) at the 6.7B model scale. Furthermore, we show that repeating data intelligently consistently outperforms baseline training (while repeating random data performs worse than baseline training). Our results indicate that clever data selection can significantly improve LLM pre-training, calls into question the common practice of training for a single epoch on as much data as possible, and demonstrates a path to keep improving our models past the limits of randomly sampling web data.

Related papers

Progressive Data Dropout: An Embarrassingly Simple Approach to Faster Training [26.65053392031144]
We propose a series of alternative training paradigms that leverage insights from hard-data-mining and dropout.<n>The proposed Progressive Data Dropout reduces the number of effective epochs to as little as 12.4% of the baseline.<n>Surprisingly, the proposed method improves accuracy by up to 4.82%.
arXiv Detail & Related papers (2025-05-28T13:26:52Z)
What Do Learning Dynamics Reveal About Generalization in LLM Reasoning? [83.83230167222852]
We find that a model's generalization behavior can be effectively characterized by a training metric we call pre-memorization train accuracy. By connecting a model's learning behavior to its generalization, pre-memorization train accuracy can guide targeted improvements to training strategies.
arXiv Detail & Related papers (2024-11-12T09:52:40Z)
Improving Pretraining Data Using Perplexity Correlations [56.41097718862742]
We present a framework that selects high-quality pretraining data without any LLM training of our own. We build a new statistical framework for data selection centered around estimates of perplexity-benchmark correlations. Our approach outperforms DSIR on every benchmark, while matching the best data selector found in DataComp-LM.
arXiv Detail & Related papers (2024-09-09T17:23:29Z)
Beyond Efficiency: Molecular Data Pruning for Enhanced Generalization [30.738229850748137]
MolPeg is a Molecular data Pruning framework for enhanced Generalization. It focuses on the source-free data pruning scenario, where data pruning is applied with pretrained models. It consistently outperforms existing DP methods across four downstream tasks.
arXiv Detail & Related papers (2024-09-02T09:06:04Z)
AutoScale: Automatic Prediction of Compute-optimal Data Composition for Training LLMs [61.13296177652599]
This paper demonstrates that the optimal composition of training data from different domains is scale-dependent. We introduce *AutoScale*, a novel, practical approach for optimizing data compositions at potentially large training data scales. Our evaluation on GPT-2 Large and BERT pre-training demonstrates *AutoScale*'s effectiveness in improving training convergence and downstream performance.
arXiv Detail & Related papers (2024-07-29T17:06:30Z)
How to Train Data-Efficient LLMs [56.41105687693619]
We study data-efficient approaches for pre-training language models (LLMs) We find that Ask-LLM and Density sampling are the best methods in their respective categories. In our comparison of 19 samplers, involving hundreds of evaluation tasks and pre-training runs, we find that Ask-LLM and Density are the best methods in their respective categories.
arXiv Detail & Related papers (2024-02-15T02:27:57Z)
Efficient Online Data Mixing For Language Model Pre-Training [101.45242332613944]
Existing data selection methods suffer from slow and computationally expensive processes. Data mixing, on the other hand, reduces the complexity of data selection by grouping data points together. We develop an efficient algorithm for Online Data Mixing (ODM) that combines elements from both data selection and data mixing.
arXiv Detail & Related papers (2023-12-05T00:42:35Z)
SwiftLearn: A Data-Efficient Training Method of Deep Learning Models using Importance Sampling [3.8330834108666667]
We present SwiftLearn, a data-efficient approach to accelerate training of deep learning models. This subset is selected based on an importance criteria measured over the entire dataset during warm-up stages. We show that almost 90% of the data can be dropped achieving an end-to-end average speedup of 3.36x while keeping the average accuracy drop less than 0.92%.
arXiv Detail & Related papers (2023-11-25T22:51:01Z)
Efficient Grammatical Error Correction Via Multi-Task Training and Optimized Training Schedule [55.08778142798106]
We propose auxiliary tasks that exploit the alignment between the original and corrected sentences. We formulate each task as a sequence-to-sequence problem and perform multi-task training. We find that the order of datasets used for training and even individual instances within a dataset may have important effects on the final performance.
arXiv Detail & Related papers (2023-11-20T14:50:12Z)
Gradient-guided Loss Masking for Neural Machine Translation [27.609155878513334]
In this paper, we explore strategies that dynamically optimize data usage during the training process. Our algorithm calculates the gradient alignment between the training data and the clean data to mask out data with negative alignment. Experiments on three WMT language pairs show that our method brings significant improvement over strong baselines.
arXiv Detail & Related papers (2021-02-26T15:41:48Z)

This list is automatically generated from the titles and abstracts of the papers in this site.