Related papers: Understanding Model Merging: A Unified Generalization Framework for Heterogeneous Experts

Understanding Model Merging: A Unified Generalization Framework for Heterogeneous Experts

URL: http://arxiv.org/abs/2601.21690v1
Date: Thu, 29 Jan 2026 13:22:06 GMT
Title: Understanding Model Merging: A Unified Generalization Framework for Heterogeneous Experts
Authors: Qinglun Li, Anke Tang, Miao Zhang, Mengzhu Wang, Quanjun Yin, Li Shen,
Abstract summary: Model merging efficiently aggregates capabilities from multiple fine-tuned models into a single one.<n>Despite empirical successes, a unified theory for its effectiveness under heterogeneous finetuning hyper parameters remains missing.<n>We use $L$-Stability theory to analyze the generalization of the merged model $boldsymbolx_avg$.
Score: 36.26786113564521
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Model merging efficiently aggregates capabilities from multiple fine-tuned models into a single one, operating purely in parameter space without original data or expensive re-computation. Despite empirical successes, a unified theory for its effectiveness under heterogeneous finetuning hyperparameters (e.g., varying learning rates, batch sizes) remains missing. Moreover, the lack of hyperparameter transparency in open-source fine-tuned models makes it difficult to predict merged-model performance, leaving practitioners without guidance on how to fine-tune merge-friendly experts. To address those two challenges, we employ $L_2$-Stability theory under heterogeneous hyperparameter environments to analyze the generalization of the merged model $\boldsymbol{x}_{avg}$. This pioneering analysis yields two key contributions: (i) \textit{A unified theoretical framework} is provided to explain existing merging algorithms, revealing how they optimize specific terms in our bound, thus offering a strong theoretical foundation for empirical observations. (ii) \textit{Actionable recommendations} are proposed for practitioners to strategically fine-tune expert models, enabling the construction of merge-friendly models within the pretraining-to-finetuning pipeline. Extensive experiments on the ResNet/Vit family across 20/8 visual classification tasks, involving thousands of finetuning models, robustly confirm the impact of different hyperparameters on the generalization of $\boldsymbol{x}_{avg}$ predicted by our theoretical results.

Related papers

ACE-Merging: Data-Free Model Merging with Adaptive Covariance Estimation [34.173549610331385]
Model merging aims to combine multiple task-specific expert models into a single model.<n>Interference among experts, especially when they are trained on different objectives, often leads to significant performance degradation.<n>acem is an Adaptive Covariance Estimation framework that effectively mitigates inter-task interference.
arXiv Detail & Related papers (2026-03-03T12:53:04Z)
Improving Minimax Estimation Rates for Contaminated Mixture of Multinomial Logistic Experts via Expert Heterogeneity [49.809923981964715]
Contaminated mixture of experts (MoE) is motivated by transfer learning methods where a pre-trained model, acting as a frozen expert, is integrated with an adapter model, functioning as a trainable expert, in order to learn a new task.<n>In this work, we characterize uniform convergence rates for estimating parameters under challenging settings where ground-truth parameters vary with the sample size.<n>We also establish corresponding minimax lower bounds to ensure that these rates are minimax optimal.
arXiv Detail & Related papers (2026-01-31T23:45:50Z)
Mixture-of-Experts Models in Vision: Routing, Optimization, and Generalization [0.0]
We study MoE behavior in an image classification setting, focusing on predictive performance, expert utilization, and generalization.<n>We compare dense, SoftMoE, and SparseMoE classifier heads on the CIFAR10 dataset under comparable model capacity.<n>Both MoE variants achieve slightly higher validation accuracy than the dense baseline while maintaining balanced expert utilization through regularization.<n>We find that SoftMoE exhibits higher sharpness by these metrics, while Dense and SparseMoE lie in a similar curvature regime, despite all models achieving comparable generalization performance.
arXiv Detail & Related papers (2026-01-21T14:22:25Z)
How to Set the Learning Rate for Large-Scale Pre-training? [73.03133634525635]
We formalize this investigation into two distinct research paradigms: Fitting and Transfer.<n>Within the Fitting Paradigm, we introduce a Scaling Law for search factor, effectively reducing the search complexity from O(n3) to O(n*C_D*C_) via predictive modeling.<n>We extend the principles of $$Transfer to the Mixture of Experts (MoE) architecture, broadening its applicability to encompass model depth, weight decay, and token horizons.
arXiv Detail & Related papers (2026-01-08T15:55:13Z)
Variational Inference, Entropy, and Orthogonality: A Unified Theory of Mixture-of-Experts [11.888882732753922]
Mixture-of-Experts models enable large language models to scale efficiently, as they only activate a subset of experts for each input.<n>We build the first unified theoretical framework that derives these practices as optimal posterior approximation and prior regularization from a Bayesian perspective.<n>Our work offers essential theoretical support and technical assurance for a deeper understanding and novel designs of MoE.
arXiv Detail & Related papers (2026-01-07T04:45:07Z)
Why Do More Experts Fail? A Theoretical Analysis of Model Merging [51.18155031364046]
Model merging dramatically reduces storage and computational resources by combining multiple expert models into a single multi-task model.<n>Recent model merging methods have shown promising results, but struggle to maintain performance gains as the number of merged models increases.<n>We show that the limited effective parameter space imposes a strict constraint on the number of models that can be successfully merged.
arXiv Detail & Related papers (2025-05-27T14:10:46Z)
Self-Boost via Optimal Retraining: An Analysis via Approximate Message Passing [58.52119063742121]
Retraining a model using its own predictions together with the original, potentially noisy labels is a well-known strategy for improving the model performance.<n>This paper addresses the question of how to optimally combine the model's predictions and the provided labels.<n>Our main contribution is the derivation of the Bayes optimal aggregator function to combine the current model's predictions and the given labels.
arXiv Detail & Related papers (2025-05-21T07:16:44Z)
A Unified Virtual Mixture-of-Experts Framework:Enhanced Inference and Hallucination Mitigation in Single-Model System [9.764336669208394]
Generative models, such as GPT and BERT, have significantly improved performance in tasks like text generation and summarization.<n>However, hallucinations "where models generate non-factual or misleading content" are especially problematic in smaller-scale architectures.<n>We propose a unified Virtual Mixture-of-Experts (MoE) fusion strategy that enhances inference performance and mitigates hallucinations in a single Qwen 1.5 0.5B model.
arXiv Detail & Related papers (2025-04-01T11:38:01Z)
Theoretical Convergence Guarantees for Variational Autoencoders [2.8167997311962942]
Variational Autoencoders (VAE) are popular generative models used to sample from complex data distributions.<n>This paper aims to bridge that gap by providing non-asymptotic convergence guarantees for VAE trained using both Gradient Descent and Adam algorithms.<n>Our theoretical analysis applies to both Linear VAE and Deep Gaussian VAE, as well as several VAE variants, including $beta$-VAE and IWAE.
arXiv Detail & Related papers (2024-10-22T07:12:38Z)
Post-mortem on a deep learning contest: a Simpson's paradox and the complementary roles of scale metrics versus shape metrics [61.49826776409194]
We analyze a corpus of models made publicly-available for a contest to predict the generalization accuracy of neural network (NN) models. We identify what amounts to a Simpson's paradox: where "scale" metrics perform well overall but perform poorly on sub partitions of the data. We present two novel shape metrics, one data-independent, and the other data-dependent, which can predict trends in the test accuracy of a series of NNs.
arXiv Detail & Related papers (2021-06-01T19:19:49Z)
Posterior Differential Regularization with f-divergence for Improving Model Robustness [95.05725916287376]
We focus on methods that regularize the model posterior difference between clean and noisy inputs. We generalize the posterior differential regularization to the family of $f$-divergences. Our experiments show that regularizing the posterior differential with $f$-divergence can result in well-improved model robustness.
arXiv Detail & Related papers (2020-10-23T19:58:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.