CLIMB: CLustering-based Iterative Data Mixture Bootstrapping for Language Model Pre-training
- URL: http://arxiv.org/abs/2504.13161v1
- Date: Thu, 17 Apr 2025 17:58:13 GMT
- Title: CLIMB: CLustering-based Iterative Data Mixture Bootstrapping for Language Model Pre-training
- Authors: Shizhe Diao, Yu Yang, Yonggan Fu, Xin Dong, Dan Su, Markus Kliegl, Zijia Chen, Peter Belcak, Yoshi Suhara, Hongxu Yin, Mostofa Patwary, Yingyan, Lin, Jan Kautz, Pavlo Molchanov,
- Abstract summary: We propose an automated framework that discovers, evaluates, and refines data mixtures in a pre-training setting.<n>We introduce ClimbLab, a filtered 1.2-trillion-token corpus with 20 clusters as a research playground, and ClimbMix, a compact yet powerful 400-billion-token dataset.
- Score: 63.07024608399447
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Pre-training datasets are typically collected from web content and lack inherent domain divisions. For instance, widely used datasets like Common Crawl do not include explicit domain labels, while manually curating labeled datasets such as The Pile is labor-intensive. Consequently, identifying an optimal pre-training data mixture remains a challenging problem, despite its significant benefits for pre-training performance. To address these challenges, we propose CLustering-based Iterative Data Mixture Bootstrapping (CLIMB), an automated framework that discovers, evaluates, and refines data mixtures in a pre-training setting. Specifically, CLIMB embeds and clusters large-scale datasets in a semantic space and then iteratively searches for optimal mixtures using a smaller proxy model and a predictor. When continuously trained on 400B tokens with this mixture, our 1B model exceeds the state-of-the-art Llama-3.2-1B by 2.0%. Moreover, we observe that optimizing for a specific domain (e.g., Social Sciences) yields a 5% improvement over random sampling. Finally, we introduce ClimbLab, a filtered 1.2-trillion-token corpus with 20 clusters as a research playground, and ClimbMix, a compact yet powerful 400-billion-token dataset designed for efficient pre-training that delivers superior performance under an equal token budget. We analyze the final data mixture, elucidating the characteristics of an optimal data mixture. Our data is available at: https://research.nvidia.com/labs/lpr/climb/
Related papers
- Improving Pretraining Data Using Perplexity Correlations [56.41097718862742]
We present a framework that selects high-quality pretraining data without any LLM training of our own.<n>We build a new statistical framework for data selection centered around estimates of perplexity-benchmark correlations.<n>Our approach outperforms DSIR on every benchmark, while matching the best data selector found in DataComp-LM.
arXiv Detail & Related papers (2024-09-09T17:23:29Z) - RegMix: Data Mixture as Regression for Language Model Pre-training [40.45464495981735]
We propose RegMix to automatically identify a high-performing data mixture by formulating it as a regression task.
RegMix trains many small models on diverse data mixtures, uses regression to predict performance of unseen mixtures, and applies the best predicted mixture to train a large-scale model with orders of magnitude more compute.
arXiv Detail & Related papers (2024-07-01T17:31:03Z) - Does your data spark joy? Performance gains from domain upsampling at the end of training [16.572129046599937]
It is expensive to understand the impact of domain-specific datasets on training at large FL model scales.
We use domain upsampling to characterize at scale the utility of individual datasets for improving various benchmarks.
This tool opens up the ability to experiment with the impact of different pretraining datasets at scale, but at an order of lower cost compared to full pretraining runs.
arXiv Detail & Related papers (2024-06-05T17:29:15Z) - Data Mixing Laws: Optimizing Data Mixtures by Predicting Language Modeling Performance [55.872926690722714]
We study the predictability of model performance regarding the mixture proportions in function forms.<n>We propose nested use of the scaling laws of training steps, model sizes, and our data mixing law.<n>Our method effectively optimize the training mixture of a 1B model trained for 100B tokens in RedPajama.
arXiv Detail & Related papers (2024-03-25T17:14:00Z) - Task-customized Masked AutoEncoder via Mixture of Cluster-conditional
Experts [104.9871176044644]
Masked Autoencoder(MAE) is a prevailing self-supervised learning method that achieves promising results in model pre-training.
We propose a novel MAE-based pre-training paradigm, Mixture of Cluster-conditional Experts (MoCE)
MoCE trains each expert only with semantically relevant images by using cluster-conditional gates.
arXiv Detail & Related papers (2024-02-08T03:46:32Z) - Efficient Online Data Mixing For Language Model Pre-Training [101.45242332613944]
Existing data selection methods suffer from slow and computationally expensive processes.
Data mixing, on the other hand, reduces the complexity of data selection by grouping data points together.
We develop an efficient algorithm for Online Data Mixing (ODM) that combines elements from both data selection and data mixing.
arXiv Detail & Related papers (2023-12-05T00:42:35Z) - DoubleMix: Simple Interpolation-Based Data Augmentation for Text
Classification [56.817386699291305]
This paper proposes a simple yet effective data augmentation approach termed DoubleMix.
DoubleMix first generates several perturbed samples for each training data.
It then uses the perturbed data and original data to carry out a two-step in the hidden space of neural models.
arXiv Detail & Related papers (2022-09-12T15:01:04Z) - A Data Cartography based MixUp for Pre-trained Language Models [47.90235939359225]
MixUp is a data augmentation strategy where additional samples are generated during training by combining random pairs of training samples and their labels.
We propose TDMixUp, a novel MixUp strategy that leverages Training Dynamics and allows more informative samples to be combined for generating new data samples.
We empirically validate that our method not only achieves competitive performance using a smaller subset of the training data compared with strong baselines, but also yields lower expected calibration error on the pre-trained language model, BERT, on both in-domain and out-of-domain settings in a wide range of NLP tasks.
arXiv Detail & Related papers (2022-05-06T17:59:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.