D4: Improving LLM Pretraining via Document De-Duplication and
Diversification
- URL: http://arxiv.org/abs/2308.12284v1
- Date: Wed, 23 Aug 2023 17:58:14 GMT
- Title: D4: Improving LLM Pretraining via Document De-Duplication and
Diversification
- Authors: Kushal Tirumala, Daniel Simig, Armen Aghajanyan, Ari S. Morcos
- Abstract summary: We show that careful data selection via pre-trained model embeddings can speed up training.
We also show that repeating data intelligently consistently outperforms baseline training.
- Score: 38.84592304799403
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Over recent years, an increasing amount of compute and data has been poured
into training large language models (LLMs), usually by doing one-pass learning
on as many tokens as possible randomly selected from large-scale web corpora.
While training on ever-larger portions of the internet leads to consistent
performance improvements, the size of these improvements diminishes with scale,
and there has been little work exploring the effect of data selection on
pre-training and downstream performance beyond simple de-duplication methods
such as MinHash. Here, we show that careful data selection (on top of
de-duplicated data) via pre-trained model embeddings can speed up training (20%
efficiency gains) and improves average downstream accuracy on 16 NLP tasks (up
to 2%) at the 6.7B model scale. Furthermore, we show that repeating data
intelligently consistently outperforms baseline training (while repeating
random data performs worse than baseline training). Our results indicate that
clever data selection can significantly improve LLM pre-training, calls into
question the common practice of training for a single epoch on as much data as
possible, and demonstrates a path to keep improving our models past the limits
of randomly sampling web data.
Related papers
- What Do Learning Dynamics Reveal About Generalization in LLM Reasoning? [83.83230167222852]
We find that a model's generalization behavior can be effectively characterized by a training metric we call pre-memorization train accuracy.
By connecting a model's learning behavior to its generalization, pre-memorization train accuracy can guide targeted improvements to training strategies.
arXiv Detail & Related papers (2024-11-12T09:52:40Z) - Beyond Efficiency: Molecular Data Pruning for Enhanced Generalization [30.738229850748137]
MolPeg is a Molecular data Pruning framework for enhanced Generalization.
It focuses on the source-free data pruning scenario, where data pruning is applied with pretrained models.
It consistently outperforms existing DP methods across four downstream tasks.
arXiv Detail & Related papers (2024-09-02T09:06:04Z) - AutoScale: Automatic Prediction of Compute-optimal Data Composition for Training LLMs [61.13296177652599]
This paper demonstrates that the optimal composition of training data from different domains is scale-dependent.
We introduce *AutoScale*, a novel, practical approach for optimizing data compositions at potentially large training data scales.
Our evaluation on GPT-2 Large and BERT pre-training demonstrates *AutoScale*'s effectiveness in improving training convergence and downstream performance.
arXiv Detail & Related papers (2024-07-29T17:06:30Z) - How to Train Data-Efficient LLMs [56.41105687693619]
We study data-efficient approaches for pre-training language models (LLMs)
We find that Ask-LLM and Density sampling are the best methods in their respective categories.
In our comparison of 19 samplers, involving hundreds of evaluation tasks and pre-training runs, we find that Ask-LLM and Density are the best methods in their respective categories.
arXiv Detail & Related papers (2024-02-15T02:27:57Z) - Efficient Online Data Mixing For Language Model Pre-Training [101.45242332613944]
Existing data selection methods suffer from slow and computationally expensive processes.
Data mixing, on the other hand, reduces the complexity of data selection by grouping data points together.
We develop an efficient algorithm for Online Data Mixing (ODM) that combines elements from both data selection and data mixing.
arXiv Detail & Related papers (2023-12-05T00:42:35Z) - SwiftLearn: A Data-Efficient Training Method of Deep Learning Models
using Importance Sampling [3.8330834108666667]
We present SwiftLearn, a data-efficient approach to accelerate training of deep learning models.
This subset is selected based on an importance criteria measured over the entire dataset during warm-up stages.
We show that almost 90% of the data can be dropped achieving an end-to-end average speedup of 3.36x while keeping the average accuracy drop less than 0.92%.
arXiv Detail & Related papers (2023-11-25T22:51:01Z) - Efficient Grammatical Error Correction Via Multi-Task Training and
Optimized Training Schedule [55.08778142798106]
We propose auxiliary tasks that exploit the alignment between the original and corrected sentences.
We formulate each task as a sequence-to-sequence problem and perform multi-task training.
We find that the order of datasets used for training and even individual instances within a dataset may have important effects on the final performance.
arXiv Detail & Related papers (2023-11-20T14:50:12Z) - Gradient-guided Loss Masking for Neural Machine Translation [27.609155878513334]
In this paper, we explore strategies that dynamically optimize data usage during the training process.
Our algorithm calculates the gradient alignment between the training data and the clean data to mask out data with negative alignment.
Experiments on three WMT language pairs show that our method brings significant improvement over strong baselines.
arXiv Detail & Related papers (2021-02-26T15:41:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.