Merge to Mix: Mixing Datasets via Model Merging
- URL: http://arxiv.org/abs/2505.16066v1
- Date: Wed, 21 May 2025 22:34:13 GMT
- Title: Merge to Mix: Mixing Datasets via Model Merging
- Authors: Zhixu Silvia Tao, Kasper Vinken, Hao-Wei Yeh, Avi Cooper, Xavier Boix,
- Abstract summary: Mixing datasets for fine-tuning large models (LMs) has become critical for maximizing performance on downstream tasks.<n>We propose a novel method, $textitMerge to Mix$, that accelerates composing dataset mixtures through model merging.<n>Our experiments demonstrate that Merge to Mix surpasses state-of-the-art methods in dataset selection for fine-tuning LMs.
- Score: 2.990932417718553
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Mixing datasets for fine-tuning large models (LMs) has become critical for maximizing performance on downstream tasks. However, composing effective dataset mixtures typically relies on heuristics and trial-and-error, often requiring multiple fine-tuning runs to achieve the desired outcome. We propose a novel method, $\textit{Merge to Mix}$, that accelerates composing dataset mixtures through model merging. Model merging is a recent technique that combines the abilities of multiple individually fine-tuned LMs into a single LM by using a few simple arithmetic operations. Our key insight is that merging models individually fine-tuned on each dataset in a mixture can effectively serve as a surrogate for a model fine-tuned on the entire mixture. Merge to Mix leverages this insight to accelerate selecting dataset mixtures without requiring full fine-tuning on each candidate mixture. Our experiments demonstrate that Merge to Mix surpasses state-of-the-art methods in dataset selection for fine-tuning LMs.
Related papers
- Olmix: A Framework for Data Mixing Throughout LM Development [90.12613780066063]
Olmix is a framework that addresses the problem of data mixing in training language models.<n>Design choices across existing methods lack justification or consensus, overlook practical issues like data constraints.<n>We introduce mixture reuse, a mechanism that reuses existing ratios and recomputes ratios only for domains affected by the update.
arXiv Detail & Related papers (2026-02-12T18:16:05Z) - MaD-Mix: Multi-Modal Data Mixtures via Latent Space Coupling for Vision-Language Model Training [54.78779514101305]
MaD-Mix is a principled framework that derives multi-modal data mixtures for VLM training.<n>MaD-Mix speeds VLM training across diverse benchmarks.<n>In complex tri-modal video-image-text scenarios, MaD-Mix boosts average accuracy over uniform weights, with negligible mixture overhead.
arXiv Detail & Related papers (2026-02-08T03:07:36Z) - Linear Model Merging Unlocks Simple and Scalable Multimodal Data Mixture Optimization [38.78268216433473]
We study model merging as an efficient strategy for estimating the performance of different data mixtures.<n>We conduct experiments on 14 multimodal benchmarks, and empirically demonstrate that the proxy models exhibit a high rank correlation with models trained on actual data mixtures.
arXiv Detail & Related papers (2026-02-04T16:06:39Z) - Decouple Searching from Training: Scaling Data Mixing via Model Merging for Large Language Model Pre-training [16.022416196267937]
We propose Decouple Searching from Training Mix (DeMix), a novel framework that leverages model merging to predict optimal data ratios.<n>We show that DeMix breaks the trade-off between sufficiency, accuracy and efficiency, obtaining the optimal mixture with higher benchmark performance at lower search cost.
arXiv Detail & Related papers (2026-01-31T14:27:46Z) - MergeMix: Optimizing Mid-Training Data Mixtures via Learnable Model Merging [72.00014675808228]
textbfMix determines optimal data mixing ratios by repurposing model merging weights as a high-fidelity, low-cost performance proxy.<n>Experiments on models with 8B and 16B parameters validate that MergeMix achieves performance comparable to or surpassing exhaustive manual tuning.
arXiv Detail & Related papers (2026-01-25T14:31:57Z) - CLIMB: CLustering-based Iterative Data Mixture Bootstrapping for Language Model Pre-training [63.07024608399447]
We propose an automated framework that discovers, evaluates, and refines data mixtures in a pre-training setting.<n>We introduce ClimbLab, a filtered 1.2-trillion-token corpus with 20 clusters as a research playground, and ClimbMix, a compact yet powerful 400-billion-token dataset.
arXiv Detail & Related papers (2025-04-17T17:58:13Z) - Mixup Model Merge: Enhancing Model Merging Performance through Randomized Linear Interpolation [15.47711837051754]
We propose Mixup Model Merge, an innovative approach inspired by the Mixup data augmentation technique.<n>M$3$ is a simple yet effective model merging method that significantly enhances the performance of the merged model.
arXiv Detail & Related papers (2025-02-21T13:01:26Z) - MixMin: Finding Data Mixtures via Convex Minimization [23.369015146176928]
Machine learning pipelines are increasingly combining and mixing data from diverse and disparate sources.<n>Finding the optimal data mixture is a challenging and open problem.<n>We formalize this data mixing problem as a bi-level objective: the best mixture is the one that would lead to the best model for a downstream objective.<n>In this paper, we make the observation that the bi-level data mixing objective becomes convex as our model class becomes larger.
arXiv Detail & Related papers (2025-02-14T19:15:53Z) - RegMix: Data Mixture as Regression for Language Model Pre-training [40.45464495981735]
We propose RegMix to automatically identify a high-performing data mixture by formulating it as a regression task.<n>RegMix trains many small models on diverse data mixtures, uses regression to predict performance of unseen mixtures, and applies the best predicted mixture to train a large-scale model with orders of magnitude more compute.
arXiv Detail & Related papers (2024-07-01T17:31:03Z) - Data Mixing Laws: Optimizing Data Mixtures by Predicting Language Modeling Performance [55.872926690722714]
We study the predictability of model performance regarding the mixture proportions in function forms.<n>We propose nested use of the scaling laws of training steps, model sizes, and our data mixing law.<n>Our method effectively optimize the training mixture of a 1B model trained for 100B tokens in RedPajama.
arXiv Detail & Related papers (2024-03-25T17:14:00Z) - TransformMix: Learning Transformation and Mixing Strategies from Data [20.79680733590554]
We propose an automated approach, TransformMix, to learn better transformation and mixing augmentation strategies from data.
We demonstrate the effectiveness of TransformMix on multiple datasets in transfer learning, classification, object detection, and knowledge distillation settings.
arXiv Detail & Related papers (2024-03-19T04:36:41Z) - PowMix: A Versatile Regularizer for Multimodal Sentiment Analysis [71.8946280170493]
This paper introduces PowMix, a versatile embedding space regularizer that builds upon the strengths of unimodal mixing-based regularization approaches.
PowMix is integrated before the fusion stage of multimodal architectures and facilitates intra-modal mixing, such as mixing text with text, to act as a regularizer.
arXiv Detail & Related papers (2023-12-19T17:01:58Z) - Revisiting Permutation Symmetry for Merging Models between Different
Datasets [3.234560001579257]
We investigate the properties of merging models between different datasets.
We find that the accuracy of the merged model decreases more significantly as the datasets diverge more.
We show that condensed datasets created by dataset condensation can be used as substitutes for the original datasets.
arXiv Detail & Related papers (2023-06-09T03:00:34Z) - Learning with MISELBO: The Mixture Cookbook [62.75516608080322]
We present the first ever mixture of variational approximations for a normalizing flow-based hierarchical variational autoencoder (VAE) with VampPrior and a PixelCNN decoder network.
We explain this cooperative behavior by drawing a novel connection between VI and adaptive importance sampling.
We obtain state-of-the-art results among VAE architectures in terms of negative log-likelihood on the MNIST and FashionMNIST datasets.
arXiv Detail & Related papers (2022-09-30T15:01:35Z) - Harnessing Hard Mixed Samples with Decoupled Regularizer [69.98746081734441]
Mixup is an efficient data augmentation approach that improves the generalization of neural networks by smoothing the decision boundary with mixed data.
In this paper, we propose an efficient mixup objective function with a decoupled regularizer named Decoupled Mixup (DM)
DM can adaptively utilize hard mixed samples to mine discriminative features without losing the original smoothness of mixup.
arXiv Detail & Related papers (2022-03-21T07:12:18Z) - Mixed data Deep Gaussian Mixture Model: A clustering model for mixed
datasets [0.0]
We introduce a model-based clustering method called Mixed Deep Gaussian Mixture Model (MDGMM)
This architecture is flexible and can be adapted to mixed as well as to continuous or non-continuous data.
Our model provides continuous low-dimensional representations of the data which can be a useful tool to visualize mixed datasets.
arXiv Detail & Related papers (2020-10-13T19:52:46Z) - Robust Finite Mixture Regression for Heterogeneous Targets [70.19798470463378]
We propose an FMR model that finds sample clusters and jointly models multiple incomplete mixed-type targets simultaneously.
We provide non-asymptotic oracle performance bounds for our model under a high-dimensional learning framework.
The results show that our model can achieve state-of-the-art performance.
arXiv Detail & Related papers (2020-10-12T03:27:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.