Understanding Model Merging: A Unified Generalization Framework for Heterogeneous Experts
- URL: http://arxiv.org/abs/2601.21690v1
- Date: Thu, 29 Jan 2026 13:22:06 GMT
- Title: Understanding Model Merging: A Unified Generalization Framework for Heterogeneous Experts
- Authors: Qinglun Li, Anke Tang, Miao Zhang, Mengzhu Wang, Quanjun Yin, Li Shen,
- Abstract summary: Model merging efficiently aggregates capabilities from multiple fine-tuned models into a single one.<n>Despite empirical successes, a unified theory for its effectiveness under heterogeneous finetuning hyper parameters remains missing.<n>We use $L$-Stability theory to analyze the generalization of the merged model $boldsymbolx_avg$.
- Score: 36.26786113564521
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Model merging efficiently aggregates capabilities from multiple fine-tuned models into a single one, operating purely in parameter space without original data or expensive re-computation. Despite empirical successes, a unified theory for its effectiveness under heterogeneous finetuning hyperparameters (e.g., varying learning rates, batch sizes) remains missing. Moreover, the lack of hyperparameter transparency in open-source fine-tuned models makes it difficult to predict merged-model performance, leaving practitioners without guidance on how to fine-tune merge-friendly experts. To address those two challenges, we employ $L_2$-Stability theory under heterogeneous hyperparameter environments to analyze the generalization of the merged model $\boldsymbol{x}_{avg}$. This pioneering analysis yields two key contributions: (i) \textit{A unified theoretical framework} is provided to explain existing merging algorithms, revealing how they optimize specific terms in our bound, thus offering a strong theoretical foundation for empirical observations. (ii) \textit{Actionable recommendations} are proposed for practitioners to strategically fine-tune expert models, enabling the construction of merge-friendly models within the pretraining-to-finetuning pipeline. Extensive experiments on the ResNet/Vit family across 20/8 visual classification tasks, involving thousands of finetuning models, robustly confirm the impact of different hyperparameters on the generalization of $\boldsymbol{x}_{avg}$ predicted by our theoretical results.
Related papers
- ACE-Merging: Data-Free Model Merging with Adaptive Covariance Estimation [34.173549610331385]
Model merging aims to combine multiple task-specific expert models into a single model.<n>Interference among experts, especially when they are trained on different objectives, often leads to significant performance degradation.<n>acem is an Adaptive Covariance Estimation framework that effectively mitigates inter-task interference.
arXiv Detail & Related papers (2026-03-03T12:53:04Z) - Improving Minimax Estimation Rates for Contaminated Mixture of Multinomial Logistic Experts via Expert Heterogeneity [49.809923981964715]
Contaminated mixture of experts (MoE) is motivated by transfer learning methods where a pre-trained model, acting as a frozen expert, is integrated with an adapter model, functioning as a trainable expert, in order to learn a new task.<n>In this work, we characterize uniform convergence rates for estimating parameters under challenging settings where ground-truth parameters vary with the sample size.<n>We also establish corresponding minimax lower bounds to ensure that these rates are minimax optimal.
arXiv Detail & Related papers (2026-01-31T23:45:50Z) - Mixture-of-Experts Models in Vision: Routing, Optimization, and Generalization [0.0]
We study MoE behavior in an image classification setting, focusing on predictive performance, expert utilization, and generalization.<n>We compare dense, SoftMoE, and SparseMoE classifier heads on the CIFAR10 dataset under comparable model capacity.<n>Both MoE variants achieve slightly higher validation accuracy than the dense baseline while maintaining balanced expert utilization through regularization.<n>We find that SoftMoE exhibits higher sharpness by these metrics, while Dense and SparseMoE lie in a similar curvature regime, despite all models achieving comparable generalization performance.
arXiv Detail & Related papers (2026-01-21T14:22:25Z) - How to Set the Learning Rate for Large-Scale Pre-training? [73.03133634525635]
We formalize this investigation into two distinct research paradigms: Fitting and Transfer.<n>Within the Fitting Paradigm, we introduce a Scaling Law for search factor, effectively reducing the search complexity from O(n3) to O(n*C_D*C_) via predictive modeling.<n>We extend the principles of $$Transfer to the Mixture of Experts (MoE) architecture, broadening its applicability to encompass model depth, weight decay, and token horizons.
arXiv Detail & Related papers (2026-01-08T15:55:13Z) - Variational Inference, Entropy, and Orthogonality: A Unified Theory of Mixture-of-Experts [11.888882732753922]
Mixture-of-Experts models enable large language models to scale efficiently, as they only activate a subset of experts for each input.<n>We build the first unified theoretical framework that derives these practices as optimal posterior approximation and prior regularization from a Bayesian perspective.<n>Our work offers essential theoretical support and technical assurance for a deeper understanding and novel designs of MoE.
arXiv Detail & Related papers (2026-01-07T04:45:07Z) - Why Do More Experts Fail? A Theoretical Analysis of Model Merging [51.18155031364046]
Model merging dramatically reduces storage and computational resources by combining multiple expert models into a single multi-task model.<n>Recent model merging methods have shown promising results, but struggle to maintain performance gains as the number of merged models increases.<n>We show that the limited effective parameter space imposes a strict constraint on the number of models that can be successfully merged.
arXiv Detail & Related papers (2025-05-27T14:10:46Z) - Self-Boost via Optimal Retraining: An Analysis via Approximate Message Passing [58.52119063742121]
Retraining a model using its own predictions together with the original, potentially noisy labels is a well-known strategy for improving the model performance.<n>This paper addresses the question of how to optimally combine the model's predictions and the provided labels.<n>Our main contribution is the derivation of the Bayes optimal aggregator function to combine the current model's predictions and the given labels.
arXiv Detail & Related papers (2025-05-21T07:16:44Z) - A Unified Virtual Mixture-of-Experts Framework:Enhanced Inference and Hallucination Mitigation in Single-Model System [9.764336669208394]
Generative models, such as GPT and BERT, have significantly improved performance in tasks like text generation and summarization.<n>However, hallucinations "where models generate non-factual or misleading content" are especially problematic in smaller-scale architectures.<n>We propose a unified Virtual Mixture-of-Experts (MoE) fusion strategy that enhances inference performance and mitigates hallucinations in a single Qwen 1.5 0.5B model.
arXiv Detail & Related papers (2025-04-01T11:38:01Z) - Theoretical Convergence Guarantees for Variational Autoencoders [2.8167997311962942]
Variational Autoencoders (VAE) are popular generative models used to sample from complex data distributions.<n>This paper aims to bridge that gap by providing non-asymptotic convergence guarantees for VAE trained using both Gradient Descent and Adam algorithms.<n>Our theoretical analysis applies to both Linear VAE and Deep Gaussian VAE, as well as several VAE variants, including $beta$-VAE and IWAE.
arXiv Detail & Related papers (2024-10-22T07:12:38Z) - Post-mortem on a deep learning contest: a Simpson's paradox and the
complementary roles of scale metrics versus shape metrics [61.49826776409194]
We analyze a corpus of models made publicly-available for a contest to predict the generalization accuracy of neural network (NN) models.
We identify what amounts to a Simpson's paradox: where "scale" metrics perform well overall but perform poorly on sub partitions of the data.
We present two novel shape metrics, one data-independent, and the other data-dependent, which can predict trends in the test accuracy of a series of NNs.
arXiv Detail & Related papers (2021-06-01T19:19:49Z) - Posterior Differential Regularization with f-divergence for Improving
Model Robustness [95.05725916287376]
We focus on methods that regularize the model posterior difference between clean and noisy inputs.
We generalize the posterior differential regularization to the family of $f$-divergences.
Our experiments show that regularizing the posterior differential with $f$-divergence can result in well-improved model robustness.
arXiv Detail & Related papers (2020-10-23T19:58:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.