Efficient Online Data Mixing For Language Model Pre-Training
- URL: http://arxiv.org/abs/2312.02406v2
- Date: Sat, 9 Dec 2023 00:27:18 GMT
- Title: Efficient Online Data Mixing For Language Model Pre-Training
- Authors: Alon Albalak and Liangming Pan and Colin Raffel and William Yang Wang
- Abstract summary: Existing data selection methods suffer from slow and computationally expensive processes.
Data mixing, on the other hand, reduces the complexity of data selection by grouping data points together.
We develop an efficient algorithm for Online Data Mixing (ODM) that combines elements from both data selection and data mixing.
- Score: 101.45242332613944
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The data used to pretrain large language models has a decisive impact on a
model's downstream performance, which has led to a large body of work on data
selection methods that aim to automatically determine the most suitable data to
use for pretraining. Existing data selection methods suffer from slow and
computationally expensive processes, a problem amplified by the increasing size
of models and of pretraining datasets. Data mixing, on the other hand, reduces
the complexity of data selection by grouping data points together and
determining sampling probabilities across entire groups. However, data mixing
proportions are typically fixed before training and therefore cannot adapt to
changing training dynamics. To address these limitations, we develop an
efficient algorithm for Online Data Mixing (ODM) that combines elements from
both data selection and data mixing. Based on multi-armed bandit algorithms,
our online approach optimizes the data mixing proportions during training.
Remarkably, our method trains a model that reaches the final perplexity of the
next best method with 19\% fewer training iterations, and improves performance
on the 5-shot MMLU benchmark by 1.9% relative accuracy, while adding negligible
wall-clock time during pretraining.
Related papers
- Scalable Data Ablation Approximations for Language Models through Modular Training and Merging [27.445079398772904]
We propose an efficient method for approximating data ablations which trains individual models on subsets of a training corpus.
We find that, given an arbitrary evaluation set, the perplexity score of a single model trained on a candidate set of data is strongly correlated with perplexity scores of parameter averages of models trained on distinct partitions of that data.
arXiv Detail & Related papers (2024-10-21T06:03:49Z) - A CLIP-Powered Framework for Robust and Generalizable Data Selection [51.46695086779598]
Real-world datasets often contain redundant and noisy data, imposing a negative impact on training efficiency and model performance.
Data selection has shown promise in identifying the most representative samples from the entire dataset.
We propose a novel CLIP-powered data selection framework that leverages multimodal information for more robust and generalizable sample selection.
arXiv Detail & Related papers (2024-10-15T03:00:58Z) - Data curation via joint example selection further accelerates multimodal learning [3.329535792151987]
We show that jointly selecting batches of data is more effective for learning than selecting examples independently.
We derive a simple and tractable algorithm for selecting such batches, which significantly accelerate training beyond individually-prioritized data points.
arXiv Detail & Related papers (2024-06-25T16:52:37Z) - Data Mixing Laws: Optimizing Data Mixtures by Predicting Language Modeling Performance [55.872926690722714]
We study the predictability of model performance regarding the mixture proportions in function forms.
We propose nested use of the scaling laws of training steps, model sizes, and our data mixing law.
Our method effectively optimize the training mixture of a 1B model trained for 100B tokens in RedPajama.
arXiv Detail & Related papers (2024-03-25T17:14:00Z) - How to Train Data-Efficient LLMs [56.41105687693619]
We study data-efficient approaches for pre-training language models (LLMs)
We find that Ask-LLM and Density sampling are the best methods in their respective categories.
In our comparison of 19 samplers, involving hundreds of evaluation tasks and pre-training runs, we find that Ask-LLM and Density are the best methods in their respective categories.
arXiv Detail & Related papers (2024-02-15T02:27:57Z) - DsDm: Model-Aware Dataset Selection with Datamodels [81.01744199870043]
Standard practice is to filter for examples that match human notions of data quality.
We find that selecting according to similarity with "high quality" data sources may not increase (and can even hurt) performance compared to randomly selecting data.
Our framework avoids handpicked notions of data quality, and instead models explicitly how the learning process uses train datapoints to predict on the target tasks.
arXiv Detail & Related papers (2024-01-23T17:22:00Z) - Efficient Grammatical Error Correction Via Multi-Task Training and
Optimized Training Schedule [55.08778142798106]
We propose auxiliary tasks that exploit the alignment between the original and corrected sentences.
We formulate each task as a sequence-to-sequence problem and perform multi-task training.
We find that the order of datasets used for training and even individual instances within a dataset may have important effects on the final performance.
arXiv Detail & Related papers (2023-11-20T14:50:12Z) - D4: Improving LLM Pretraining via Document De-Duplication and
Diversification [38.84592304799403]
We show that careful data selection via pre-trained model embeddings can speed up training.
We also show that repeating data intelligently consistently outperforms baseline training.
arXiv Detail & Related papers (2023-08-23T17:58:14Z) - Data Augmentation by Selecting Mixed Classes Considering Distance
Between Classes [9.690454593095495]
Methods that generate mixed data from multiple data sets, such as mixup, contribute significantly to accuracy improvement.
We propose a data augmentation method that calculates the distance between classes based on class probabilities.
We show that the proposed method improves recognition performance on general and long-tailed image recognition datasets.
arXiv Detail & Related papers (2022-09-12T10:10:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.