Related papers: Linear Model Merging Unlocks Simple and Scalable Multimodal Data Mixture Optimization

Linear Model Merging Unlocks Simple and Scalable Multimodal Data Mixture Optimization

URL: http://arxiv.org/abs/2602.04937v1
Date: Wed, 04 Feb 2026 16:06:39 GMT
Title: Linear Model Merging Unlocks Simple and Scalable Multimodal Data Mixture Optimization
Authors: Davide Berasi, Matteo Farina, Massimiliano Mancini, Elisa Ricci,
Abstract summary: We study model merging as an efficient strategy for estimating the performance of different data mixtures.<n>We conduct experiments on 14 multimodal benchmarks, and empirically demonstrate that the proxy models exhibit a high rank correlation with models trained on actual data mixtures.
Score: 38.78268216433473
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Selecting the best data mixture is critical for successful Supervised Fine-Tuning (SFT) of Multimodal Large Language Models. However, determining the optimal mixture weights across multiple domain-specific datasets remains a significant bottleneck due to the combinatorial search space and the high cost associated with even a single training run. This is the so-called Data Mixture Optimization (DMO) problem. On the other hand, model merging unifies domain-specific experts through parameter interpolation. This strategy is efficient, as it only requires a single training run per domain, yet oftentimes leads to suboptimal models. In this work, we take the best of both worlds, studying model merging as an efficient strategy for estimating the performance of different data mixtures. We train domain-specific multimodal experts and evaluate their weighted parameter-space combinations to estimate the efficacy of corresponding data mixtures. We conduct extensive experiments on 14 multimodal benchmarks, and empirically demonstrate that the merged proxy models exhibit a high rank correlation with models trained on actual data mixtures. This decouples the search for optimal mixtures from the resource-intensive training process, thereby providing a scalable and efficient strategy for navigating the complex landscape of mixture weights. Code is publicly available at https://github.com/BerasiDavide/mLLMs_merging_4_DMO.

Related papers

Decouple Searching from Training: Scaling Data Mixing via Model Merging for Large Language Model Pre-training [16.022416196267937]
We propose Decouple Searching from Training Mix (DeMix), a novel framework that leverages model merging to predict optimal data ratios.<n>We show that DeMix breaks the trade-off between sufficiency, accuracy and efficiency, obtaining the optimal mixture with higher benchmark performance at lower search cost.
arXiv Detail & Related papers (2026-01-31T14:27:46Z)
MergeMix: Optimizing Mid-Training Data Mixtures via Learnable Model Merging [72.00014675808228]
textbfMix determines optimal data mixing ratios by repurposing model merging weights as a high-fidelity, low-cost performance proxy.<n>Experiments on models with 8B and 16B parameters validate that MergeMix achieves performance comparable to or surpassing exhaustive manual tuning.
arXiv Detail & Related papers (2026-01-25T14:31:57Z)
ADMIRE-BayesOpt: Accelerated Data MIxture RE-weighting for Language Models with Bayesian Optimization [11.433087692377779]
We present results for pre-training and instruction finetuning across models ranging from 1 million to 7 billion parameters.<n>We demonstrate consistently strong results relative to a wide range of baselines, resulting inspeed-ups of over 500%.<n>In addition, we broaden access to research by sharing ADMIRE IFT Runs, a dataset of 460 full training & evaluation runs worth over 13,000 GPU hours.
arXiv Detail & Related papers (2025-08-15T15:53:09Z)
MoDoMoDo: Multi-Domain Data Mixtures for Multimodal LLM Reinforcement Learning [37.71233459623324]
Reinforcement Learning with Verifiable Rewards (RLVR) has recently emerged as a powerful paradigm for post-training large language models (LLMs)<n>Applying RLVR to Multimodal LLMs (MLLMs) presents significant opportunities but is complicated by the broader, heterogeneous nature of vision-language tasks.<n>We introduce a systematic post-training framework for Multimodal LLM RLVR, featuring a rigorous data mixture problem formulation and benchmark implementation.
arXiv Detail & Related papers (2025-05-30T17:59:38Z)
Merge to Mix: Mixing Datasets via Model Merging [2.990932417718553]
Mixing datasets for fine-tuning large models (LMs) has become critical for maximizing performance on downstream tasks.<n>We propose a novel method, $textitMerge to Mix$, that accelerates composing dataset mixtures through model merging.<n>Our experiments demonstrate that Merge to Mix surpasses state-of-the-art methods in dataset selection for fine-tuning LMs.
arXiv Detail & Related papers (2025-05-21T22:34:13Z)
CLIMB: CLustering-based Iterative Data Mixture Bootstrapping for Language Model Pre-training [63.07024608399447]
We propose an automated framework that discovers, evaluates, and refines data mixtures in a pre-training setting.<n>We introduce ClimbLab, a filtered 1.2-trillion-token corpus with 20 clusters as a research playground, and ClimbMix, a compact yet powerful 400-billion-token dataset.
arXiv Detail & Related papers (2025-04-17T17:58:13Z)
Data Mixing Laws: Optimizing Data Mixtures by Predicting Language Modeling Performance [55.872926690722714]
We study the predictability of model performance regarding the mixture proportions in function forms.<n>We propose nested use of the scaling laws of training steps, model sizes, and our data mixing law.<n>Our method effectively optimize the training mixture of a 1B model trained for 100B tokens in RedPajama.
arXiv Detail & Related papers (2024-03-25T17:14:00Z)
Optimizing Server-side Aggregation For Robust Federated Learning via Subspace Training [80.03567604524268]
Non-IID data distribution across clients and poisoning attacks are two main challenges in real-world federated learning systems. We propose SmartFL, a generic approach that optimize the server-side aggregation process. We provide theoretical analyses of the convergence and generalization capacity for SmartFL.
arXiv Detail & Related papers (2022-11-10T13:20:56Z)
Harnessing Hard Mixed Samples with Decoupled Regularizer [69.98746081734441]
Mixup is an efficient data augmentation approach that improves the generalization of neural networks by smoothing the decision boundary with mixed data. In this paper, we propose an efficient mixup objective function with a decoupled regularizer named Decoupled Mixup (DM) DM can adaptively utilize hard mixed samples to mine discriminative features without losing the original smoothness of mixup.
arXiv Detail & Related papers (2022-03-21T07:12:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.