RegMix: Data Mixture as Regression for Language Model Pre-training
- URL: http://arxiv.org/abs/2407.01492v2
- Date: Thu, 23 Jan 2025 17:35:43 GMT
- Title: RegMix: Data Mixture as Regression for Language Model Pre-training
- Authors: Qian Liu, Xiaosen Zheng, Niklas Muennighoff, Guangtao Zeng, Longxu Dou, Tianyu Pang, Jing Jiang, Min Lin,
- Abstract summary: We propose RegMix to automatically identify a high-performing data mixture by formulating it as a regression task.
RegMix trains many small models on diverse data mixtures, uses regression to predict performance of unseen mixtures, and applies the best predicted mixture to train a large-scale model with orders of magnitude more compute.
- Score: 40.45464495981735
- License:
- Abstract: The data mixture for large language model pre-training significantly impacts performance, yet how to determine an effective mixture remains unclear. We propose RegMix to automatically identify a high-performing data mixture by formulating it as a regression task. RegMix trains many small models on diverse data mixtures, uses regression to predict performance of unseen mixtures, and applies the best predicted mixture to train a large-scale model with orders of magnitude more compute. To empirically validate RegMix, we train 512 models with 1M parameters for 1B tokens to fit the regression model and predict the best data mixture. Using this mixture we train a 1B parameter model for 25B tokens (i.e. 1000x larger and 25x longer) which we find performs best among 64 candidate 1B parameter models with other mixtures. Furthermore, RegMix consistently outperforms human selection in experiments involving models up to 7B models trained on 100B tokens, while matching or exceeding DoReMi using just 10% of the computational resources. Our experiments also show that (1) Data mixtures significantly impact performance; (2) Web corpora rather than data perceived as high-quality like Wikipedia have the strongest positive correlation with downstream performance; (3) Domains interact in complex ways often contradicting common sense, thus automatic approaches like RegMix are needed; (4) Data mixture effects transcend scaling laws. Our code is available at https://github.com/sail-sg/regmix.
Related papers
- MixMin: Finding Data Mixtures via Convex Minimization [23.369015146176928]
Machine learning pipelines are increasingly combining and mixing data from diverse and disparate sources.
Finding the optimal data mixture is a challenging and open problem.
We formalize this data mixing problem as a bi-level objective: the best mixture is the one that would lead to the best model for a downstream objective.
In this paper, we make the observation that the bi-level data mixing objective becomes convex as our model class becomes larger.
arXiv Detail & Related papers (2025-02-14T19:15:53Z) - RC-Mixup: A Data Augmentation Strategy against Noisy Data for Regression Tasks [27.247270530020664]
We study the problem of robust data augmentation for regression tasks in the presence of noisy data.
C-Mixup is more selective in which samples to mix based on their label distances for better regression performance.
We propose RC-Mixup, which tightly integrates C-Mixup with multi-round robust training methods for a synergistic effect.
arXiv Detail & Related papers (2024-05-28T08:02:42Z) - Data Mixing Laws: Optimizing Data Mixtures by Predicting Language Modeling Performance [55.872926690722714]
We study the predictability of model performance regarding the mixture proportions in function forms.
We propose nested use of the scaling laws of training steps, model sizes, and our data mixing law.
Our method effectively optimize the training mixture of a 1B model trained for 100B tokens in RedPajama.
arXiv Detail & Related papers (2024-03-25T17:14:00Z) - Efficient Online Data Mixing For Language Model Pre-Training [101.45242332613944]
Existing data selection methods suffer from slow and computationally expensive processes.
Data mixing, on the other hand, reduces the complexity of data selection by grouping data points together.
We develop an efficient algorithm for Online Data Mixing (ODM) that combines elements from both data selection and data mixing.
arXiv Detail & Related papers (2023-12-05T00:42:35Z) - Efficient Grammatical Error Correction Via Multi-Task Training and
Optimized Training Schedule [55.08778142798106]
We propose auxiliary tasks that exploit the alignment between the original and corrected sentences.
We formulate each task as a sequence-to-sequence problem and perform multi-task training.
We find that the order of datasets used for training and even individual instances within a dataset may have important effects on the final performance.
arXiv Detail & Related papers (2023-11-20T14:50:12Z) - Learning from aggregated data with a maximum entropy model [73.63512438583375]
We show how a new model, similar to a logistic regression, may be learned from aggregated data only by approximating the unobserved feature distribution with a maximum entropy hypothesis.
We present empirical evidence on several public datasets that the model learned this way can achieve performances comparable to those of a logistic model trained with the full unaggregated data.
arXiv Detail & Related papers (2022-10-05T09:17:27Z) - Harnessing Hard Mixed Samples with Decoupled Regularizer [69.98746081734441]
Mixup is an efficient data augmentation approach that improves the generalization of neural networks by smoothing the decision boundary with mixed data.
In this paper, we propose an efficient mixup objective function with a decoupled regularizer named Decoupled Mixup (DM)
DM can adaptively utilize hard mixed samples to mine discriminative features without losing the original smoothness of mixup.
arXiv Detail & Related papers (2022-03-21T07:12:18Z) - MixRL: Data Mixing Augmentation for Regression using Reinforcement
Learning [2.1345682889327837]
Existing techniques for data augmentation largely focus on classification tasks and do not readily apply to regression tasks.
We show that mixing examples that either have a large data or label distance may have an increasingly-negative effect on model performance.
We propose MixRL, a data augmentation meta learning framework for regression that learns for each example how many nearest neighbors it should be mixed with for the best model performance.
arXiv Detail & Related papers (2021-06-07T07:01:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.