Expert Merging: Model Merging with Unsupervised Expert Alignment and Importance-Guided Layer Chunking
- URL: http://arxiv.org/abs/2509.25712v1
- Date: Tue, 30 Sep 2025 03:16:24 GMT
- Title: Expert Merging: Model Merging with Unsupervised Expert Alignment and Importance-Guided Layer Chunking
- Authors: Dengming Zhang, Xiaowen Ma, Zhenliang Ni, Zhenkai Wu, Han Shu, Xin Jiang, Xinghao Chen,
- Abstract summary: Expert Merging is a training-light method that learns a small set of layer-wise coefficients using unlabeled calibration data.<n>To capture inter-layer variation, Expert Merging++ augments this design with importance-guided chunking.<n>Our method surpasses strong training-free and training-based merging baselines.
- Score: 18.604455802016233
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Model merging, which combines multiple domain-specialized experts into a single model, offers a practical path to endow Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) with broad capabilities without the cost of joint training or serving many models. However, training-free methods rely on hand-tuned coefficients, whereas training-based methods primarily align parameters rather than downstream task behavior and typically treat all layers uniformly, ignoring inter-layer heterogeneity. We introduce Expert Merging, a training-light method that learns a small set of layer-wise coefficients using only unlabeled calibration data. The coefficients are optimized to explicitly align the merged model's hidden states and logits with those of the corresponding experts, with a coefficient regularizer for stability and task-weighted losses for controllable trade-offs. To capture inter-layer variation, Expert Merging++ augments this design with importance-guided chunking: a normalized layer-importance metric, derived from learned coefficients, task-vector magnitudes, and parameter counts, allocates more chunk-wise coefficients to high-importance layers while keeping low-importance layers lightweight. The result is a label-free, parameter-efficient, and scalable approach to multi-expert model merging across LLMs and MLLMs. Across MLLM backbones (InternVL and Qwen2-VL) and the LLM backbone (Mistral), our method surpasses strong training-free and training-based merging baselines, with Expert Merging++ delivering further gains and, in some cases, even exceeding supervised Mixture Training. The source code is available at https://github.com/Littleor/ExpertMerging.
Related papers
- Training-free LLM Merging for Multi-task Learning [74.93025750111019]
Hi-Merging is a training-free method for unifying different specialized LLMs into a single model.<n>Experiments on multiple-choice and question-answering tasks in both Chinese and English validate Hi-Merging's ability for multi-task learning.
arXiv Detail & Related papers (2025-06-14T07:21:11Z) - Towards Minimizing Feature Drift in Model Merging: Layer-wise Task Vector Fusion for Adaptive Knowledge Integration [14.503741632243646]
Multi-task model merging aims to consolidate knowledge from multiple task-specific experts into a unified model.<n>Existing methods approach this by minimizing differences between task-specific experts and the unified model.<n>We propose Layer-wise Optimal Task Vector Merging, a technique that explicitly minimizes feature drift between task-specific experts and the unified model.
arXiv Detail & Related papers (2025-05-29T08:11:31Z) - OptMerge: Unifying Multimodal LLM Capabilities and Modalities via Model Merging [124.91183814854126]
Model merging seeks to combine multiple expert models into a single model.<n>We introduce a benchmark for model merging research that clearly divides the tasks for MLLM training and evaluation.<n>We find that model merging offers a promising way for building improved MLLMs without requiring training data.
arXiv Detail & Related papers (2025-05-26T12:23:14Z) - MergeBench: A Benchmark for Merging Domain-Specialized LLMs [19.49737955489798]
MergeBench is an evaluation suite designed to assess model merging at scale.<n>It builds on state-of-the-art open-source language models, including Llama and Gemma families at 2B to 9B scales.<n>We assess eight representative merging methods across multi-task performance, forgetting and runtime efficiency.
arXiv Detail & Related papers (2025-05-16T04:02:55Z) - Can a Crow Hatch a Falcon? Lineage Matters in Predicting Large Language Model Performance [2.649901869321331]
We propose a novel Lineage-Regularized Matrix Factorization framework that encodes ancestral ties among Large Language Models.<n>By leveraging multi-hop parent-child connections, LRMF consistently outperforms conventional matrix factorization and collaborative filtering methods.<n>Our large-scale study includes 2,934 publicly available Hugging Face models and 21,000+ instances across 6 major benchmarks.
arXiv Detail & Related papers (2025-04-28T14:08:45Z) - LEWIS (LayEr WIse Sparsity) -- A Training Free Guided Model Merging Approach [0.0]
LEWIS (Layer Wise Sparsity) is a guided model-merging framework.<n>It guides existing merging methods by preserving essential layer-wise task-specific knowledge.<n>Experiments demonstrate the effectiveness of LEWIS with performance improvements of code instruction-following and math-solving models.
arXiv Detail & Related papers (2025-03-05T20:09:59Z) - LESA: Learnable LLM Layer Scaling-Up [57.0510934286449]
Training Large Language Models (LLMs) from scratch requires immense computational resources, making it prohibitively expensive.<n>Model scaling-up offers a promising solution by leveraging the parameters of smaller models to create larger ones.<n>We propose textbfLESA, a novel learnable method for depth scaling-up.
arXiv Detail & Related papers (2025-02-19T14:58:48Z) - Extend Model Merging from Fine-Tuned to Pre-Trained Large Language Models via Weight Disentanglement [72.97553348776425]
We make a pioneering effort to broaden the applicability of merging techniques from FT to PT LLMs.
We introduce an approach based on WeIght DisENtanglement (WIDEN) to effectively extend the merging scope.
We merge Qwen1.5-Chat (an FT LLM with instruction-following skills) with Sailor (a PT LLM with multilingual abilities) across 7B and 14B model scales.
arXiv Detail & Related papers (2024-08-06T10:46:46Z) - Delta-CoMe: Training-Free Delta-Compression with Mixed-Precision for Large Language Models [79.46938238953916]
Fine-tuning large language models (LLMs) to diverse applications is crucial to meet complex demands.
Recent studies suggest decomposing a fine-tuned LLM into a base model and corresponding delta weights, which are then compressed using low-rank or low-bit approaches to reduce costs.
In this work, we observe that existing low-rank and low-bit compression methods can significantly harm the model performance for task-specific fine-tuned LLMs.
arXiv Detail & Related papers (2024-06-13T07:57:27Z) - AdaMerging: Adaptive Model Merging for Multi-Task Learning [68.75885518081357]
This paper introduces an innovative technique called Adaptive Model Merging (AdaMerging)
It aims to autonomously learn the coefficients for model merging, either in a task-wise or layer-wise manner, without relying on the original training data.
Compared to the current state-of-the-art task arithmetic merging scheme, AdaMerging showcases a remarkable 11% improvement in performance.
arXiv Detail & Related papers (2023-10-04T04:26:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.