Decouple Searching from Training: Scaling Data Mixing via Model Merging for Large Language Model Pre-training
- URL: http://arxiv.org/abs/2602.00747v1
- Date: Sat, 31 Jan 2026 14:27:46 GMT
- Title: Decouple Searching from Training: Scaling Data Mixing via Model Merging for Large Language Model Pre-training
- Authors: Shengrui Li, Fei Zhao, Kaiyan Zhao, Jieying Ye, Haifeng Liu, Fangcheng Shi, Zheyong Xie, Yao Hu, Shaosheng Cao,
- Abstract summary: We propose Decouple Searching from Training Mix (DeMix), a novel framework that leverages model merging to predict optimal data ratios.<n>We show that DeMix breaks the trade-off between sufficiency, accuracy and efficiency, obtaining the optimal mixture with higher benchmark performance at lower search cost.
- Score: 16.022416196267937
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Determining an effective data mixture is a key factor in Large Language Model (LLM) pre-training, where models must balance general competence with proficiency on hard tasks such as math and code. However, identifying an optimal mixture remains an open challenge, as existing approaches either rely on unreliable tiny-scale proxy experiments or require prohibitively expensive large-scale exploration. To address this, we propose Decouple Searching from Training Mix (DeMix), a novel framework that leverages model merging to predict optimal data ratios. Instead of training proxy models for every sampled mixture, DeMix trains component models on candidate datasets at scale and derives data mixture proxies via weighted model merging. This paradigm decouples search from training costs, enabling evaluation of unlimited sampled mixtures without extra training burden and thus facilitating better mixture discovery through more search trials. Extensive experiments demonstrate that DeMix breaks the trade-off between sufficiency, accuracy and efficiency, obtaining the optimal mixture with higher benchmark performance at lower search cost. Additionally, we release the DeMix Corpora, a comprehensive 22T-token dataset comprising high-quality pre-training data with validated mixtures to facilitate open research. Our code and DeMix Corpora is available at https://github.com/Lucius-lsr/DeMix.
Related papers
- Linear Model Merging Unlocks Simple and Scalable Multimodal Data Mixture Optimization [38.78268216433473]
We study model merging as an efficient strategy for estimating the performance of different data mixtures.<n>We conduct experiments on 14 multimodal benchmarks, and empirically demonstrate that the proxy models exhibit a high rank correlation with models trained on actual data mixtures.
arXiv Detail & Related papers (2026-02-04T16:06:39Z) - MergeMix: Optimizing Mid-Training Data Mixtures via Learnable Model Merging [72.00014675808228]
textbfMix determines optimal data mixing ratios by repurposing model merging weights as a high-fidelity, low-cost performance proxy.<n>Experiments on models with 8B and 16B parameters validate that MergeMix achieves performance comparable to or surpassing exhaustive manual tuning.
arXiv Detail & Related papers (2026-01-25T14:31:57Z) - TREX: Tokenizer Regression for Optimal Data Mixture [10.917621429052183]
Tokenizer Regression for Optimal Data MiXture (TREX) is a regression-based framework that efficiently predicts the optimal data mixture for tokenizer training.<n>TREX trains small-scale proxy tokenizers on random mixtures, gathers their compression statistics, and learns to predict compression performance from data mixtures.<n>TReX's predicted mixtures outperform mixtures based on LLaMA3 and uniform distributions by up to 12% in both inand out-of-distribution compression efficiency.
arXiv Detail & Related papers (2026-01-20T04:41:09Z) - CLIMB: CLustering-based Iterative Data Mixture Bootstrapping for Language Model Pre-training [63.07024608399447]
We propose an automated framework that discovers, evaluates, and refines data mixtures in a pre-training setting.<n>We introduce ClimbLab, a filtered 1.2-trillion-token corpus with 20 clusters as a research playground, and ClimbMix, a compact yet powerful 400-billion-token dataset.
arXiv Detail & Related papers (2025-04-17T17:58:13Z) - MixMin: Finding Data Mixtures via Convex Minimization [23.369015146176928]
Machine learning pipelines are increasingly combining and mixing data from diverse and disparate sources.<n>Finding the optimal data mixture is a challenging and open problem.<n>We formalize this data mixing problem as a bi-level objective: the best mixture is the one that would lead to the best model for a downstream objective.<n>In this paper, we make the observation that the bi-level data mixing objective becomes convex as our model class becomes larger.
arXiv Detail & Related papers (2025-02-14T19:15:53Z) - Data Mixing Laws: Optimizing Data Mixtures by Predicting Language Modeling Performance [55.872926690722714]
We study the predictability of model performance regarding the mixture proportions in function forms.<n>We propose nested use of the scaling laws of training steps, model sizes, and our data mixing law.<n>Our method effectively optimize the training mixture of a 1B model trained for 100B tokens in RedPajama.
arXiv Detail & Related papers (2024-03-25T17:14:00Z) - Efficient Online Data Mixing For Language Model Pre-Training [101.45242332613944]
Existing data selection methods suffer from slow and computationally expensive processes.
Data mixing, on the other hand, reduces the complexity of data selection by grouping data points together.
We develop an efficient algorithm for Online Data Mixing (ODM) that combines elements from both data selection and data mixing.
arXiv Detail & Related papers (2023-12-05T00:42:35Z) - A Data Cartography based MixUp for Pre-trained Language Models [47.90235939359225]
MixUp is a data augmentation strategy where additional samples are generated during training by combining random pairs of training samples and their labels.
We propose TDMixUp, a novel MixUp strategy that leverages Training Dynamics and allows more informative samples to be combined for generating new data samples.
We empirically validate that our method not only achieves competitive performance using a smaller subset of the training data compared with strong baselines, but also yields lower expected calibration error on the pre-trained language model, BERT, on both in-domain and out-of-domain settings in a wide range of NLP tasks.
arXiv Detail & Related papers (2022-05-06T17:59:19Z) - Harnessing Hard Mixed Samples with Decoupled Regularizer [69.98746081734441]
Mixup is an efficient data augmentation approach that improves the generalization of neural networks by smoothing the decision boundary with mixed data.
In this paper, we propose an efficient mixup objective function with a decoupled regularizer named Decoupled Mixup (DM)
DM can adaptively utilize hard mixed samples to mine discriminative features without losing the original smoothness of mixup.
arXiv Detail & Related papers (2022-03-21T07:12:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.