MergeMix: Optimizing Mid-Training Data Mixtures via Learnable Model Merging
- URL: http://arxiv.org/abs/2601.17858v1
- Date: Sun, 25 Jan 2026 14:31:57 GMT
- Title: MergeMix: Optimizing Mid-Training Data Mixtures via Learnable Model Merging
- Authors: Jiapeng Wang, Changxin Tian, Kunlong Chen, Ziqi Liu, Jiaxin Mao, Wayne Xin Zhao, Zhiqiang Zhang, Jun Zhou,
- Abstract summary: textbfMix determines optimal data mixing ratios by repurposing model merging weights as a high-fidelity, low-cost performance proxy.<n>Experiments on models with 8B and 16B parameters validate that MergeMix achieves performance comparable to or surpassing exhaustive manual tuning.
- Score: 72.00014675808228
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Optimizing data mixtures is essential for unlocking the full potential of large language models (LLMs), yet identifying the optimal composition remains computationally prohibitive due to reliance on heuristic trials or expensive proxy training. To address this, we introduce \textbf{MergeMix}, a novel approach that efficiently determines optimal data mixing ratios by repurposing model merging weights as a high-fidelity, low-cost performance proxy. By training domain-specific experts on minimal tokens and optimizing their merging weights against downstream benchmarks, MergeMix effectively optimizes the performance of data mixtures without incurring the cost of full-scale training. Extensive experiments on models with 8B and 16B parameters validate that MergeMix achieves performance comparable to or surpassing exhaustive manual tuning while drastically reducing search costs. Furthermore, MergeMix exhibits high rank consistency (Spearman $ρ> 0.9$) and strong cross-scale transferability, offering a scalable, automated solution for data mixture optimization.
Related papers
- Linear Model Merging Unlocks Simple and Scalable Multimodal Data Mixture Optimization [38.78268216433473]
We study model merging as an efficient strategy for estimating the performance of different data mixtures.<n>We conduct experiments on 14 multimodal benchmarks, and empirically demonstrate that the proxy models exhibit a high rank correlation with models trained on actual data mixtures.
arXiv Detail & Related papers (2026-02-04T16:06:39Z) - Decouple Searching from Training: Scaling Data Mixing via Model Merging for Large Language Model Pre-training [16.022416196267937]
We propose Decouple Searching from Training Mix (DeMix), a novel framework that leverages model merging to predict optimal data ratios.<n>We show that DeMix breaks the trade-off between sufficiency, accuracy and efficiency, obtaining the optimal mixture with higher benchmark performance at lower search cost.
arXiv Detail & Related papers (2026-01-31T14:27:46Z) - TREX: Tokenizer Regression for Optimal Data Mixture [10.917621429052183]
Tokenizer Regression for Optimal Data MiXture (TREX) is a regression-based framework that efficiently predicts the optimal data mixture for tokenizer training.<n>TREX trains small-scale proxy tokenizers on random mixtures, gathers their compression statistics, and learns to predict compression performance from data mixtures.<n>TReX's predicted mixtures outperform mixtures based on LLaMA3 and uniform distributions by up to 12% in both inand out-of-distribution compression efficiency.
arXiv Detail & Related papers (2026-01-20T04:41:09Z) - Merge to Mix: Mixing Datasets via Model Merging [2.990932417718553]
Mixing datasets for fine-tuning large models (LMs) has become critical for maximizing performance on downstream tasks.<n>We propose a novel method, $textitMerge to Mix$, that accelerates composing dataset mixtures through model merging.<n>Our experiments demonstrate that Merge to Mix surpasses state-of-the-art methods in dataset selection for fine-tuning LMs.
arXiv Detail & Related papers (2025-05-21T22:34:13Z) - CLIMB: CLustering-based Iterative Data Mixture Bootstrapping for Language Model Pre-training [63.07024608399447]
We propose an automated framework that discovers, evaluates, and refines data mixtures in a pre-training setting.<n>We introduce ClimbLab, a filtered 1.2-trillion-token corpus with 20 clusters as a research playground, and ClimbMix, a compact yet powerful 400-billion-token dataset.
arXiv Detail & Related papers (2025-04-17T17:58:13Z) - Optimizing Pre-Training Data Mixtures with Mixtures of Data Expert Models [24.396525123797073]
We propose a method to optimize language model pre-training data mixtures through efficient approximation of the cross-entropy loss corresponding to each candidate mixture.<n>We use this approximation as a source of additional features in a regression model, trained from observations of model loss for a small number of mixtures.
arXiv Detail & Related papers (2025-02-21T21:27:48Z) - Optimizing Pretraining Data Mixtures with LLM-Estimated Utility [52.08428597962423]
Large Language Models improve with increasing amounts of high-quality training data.<n>We find token-counts outperform manual and learned mixes, indicating that simple approaches for dataset size and diversity are surprisingly effective.<n>We propose two complementary approaches: UtiliMax, which extends token-based $200s by incorporating utility estimates from reduced-scale ablations, achieving up to a 10.6x speedup over manual baselines; and Model Estimated Data Utility (MEDU), which leverages LLMs to estimate data utility from small samples, matching ablation-based performance while reducing computational requirements by $simx.
arXiv Detail & Related papers (2025-01-20T21:10:22Z) - Data Mixing Laws: Optimizing Data Mixtures by Predicting Language Modeling Performance [55.872926690722714]
We study the predictability of model performance regarding the mixture proportions in function forms.<n>We propose nested use of the scaling laws of training steps, model sizes, and our data mixing law.<n>Our method effectively optimize the training mixture of a 1B model trained for 100B tokens in RedPajama.
arXiv Detail & Related papers (2024-03-25T17:14:00Z) - Efficient Online Data Mixing For Language Model Pre-Training [101.45242332613944]
Existing data selection methods suffer from slow and computationally expensive processes.
Data mixing, on the other hand, reduces the complexity of data selection by grouping data points together.
We develop an efficient algorithm for Online Data Mixing (ODM) that combines elements from both data selection and data mixing.
arXiv Detail & Related papers (2023-12-05T00:42:35Z) - Harnessing Hard Mixed Samples with Decoupled Regularizer [69.98746081734441]
Mixup is an efficient data augmentation approach that improves the generalization of neural networks by smoothing the decision boundary with mixed data.
In this paper, we propose an efficient mixup objective function with a decoupled regularizer named Decoupled Mixup (DM)
DM can adaptively utilize hard mixed samples to mine discriminative features without losing the original smoothness of mixup.
arXiv Detail & Related papers (2022-03-21T07:12:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.