Related papers: MergeME: Model Merging Techniques for Homogeneous and Heterogeneous MoEs

MergeME: Model Merging Techniques for Homogeneous and Heterogeneous MoEs

URL: http://arxiv.org/abs/2502.00997v3
Date: Mon, 17 Feb 2025 16:51:23 GMT
Title: MergeME: Model Merging Techniques for Homogeneous and Heterogeneous MoEs
Authors: Yuhang Zhou, Giannis Karamanolakis, Victor Soto, Anna Rumshisky, Mayank Kulkarni, Furong Huang, Wei Ai, Jianhua Lu,
Abstract summary: This paper introduces new MoE merging techniques, including strategies to mitigate parameter interference, routings, and a novel method for merging experts with different architectures.<n>Experiments across multiple domains demonstrate the effectiveness of our proposed methods, reducing fine-tuning costs, improving performance over state-of-the-art methods, and expanding the applicability of MoE merging.
Score: 45.20965298945085
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: The recent success of specialized Large Language Models (LLMs) in domains such as mathematical reasoning and coding has led to growing interest in methods for merging these expert LLMs into a unified Mixture-of-Experts (MoE) model, with the goal of enhancing performance in each domain while retaining effectiveness on general tasks. However, the effective merging of expert models remains an open challenge, especially for models with highly divergent weight parameters or different architectures. State-of-the-art MoE merging methods only work with homogeneous model architectures and rely on simple unweighted averaging to merge expert layers, which does not address parameter interference and requires extensive fine-tuning of the merged MoE to restore performance. To address these limitations, this paper introduces new MoE merging techniques, including strategies to mitigate parameter interference, routing heuristics to reduce the need for MoE fine-tuning, and a novel method for merging experts with different architectures. Extensive experiments across multiple domains demonstrate the effectiveness of our proposed methods, reducing fine-tuning costs, improving performance over state-of-the-art methods, and expanding the applicability of MoE merging.

Related papers

Fine-Grained Model Merging via Modular Expert Recombination [33.253051407398836]
We propose MERGE, a method that enables component-wise model merging and input-aware, on-demand module recombination at inference.<n> MERGE formulates component-wise merging as a bi-objective optimization problem that balances cross-task performance and storage efficiency.<n>We show that MERGE consistently outperforms strong baselines and generalizes effectively.
arXiv Detail & Related papers (2026-02-06T09:55:56Z)
AutoMerge: Search-Based Model Merging Framework for Effective Model Reuse [8.950520457150178]
Recently, model merging has emerged in the domain of large language models (LLMs) as a training-free approach.<n>No prior work has systematically investigated whether such an approach can be effectively applied to other deep learning models.<n>We present the first systematic study that evaluates five model merging techniques on three distinct model architectures.
arXiv Detail & Related papers (2026-01-30T09:27:01Z)
Efficient Training of Diffusion Mixture-of-Experts Models: A Practical Recipe [51.26601054313749]
Recent efforts on Diffusion MoE models have primarily focused on developing more sophisticated routing mechanisms.<n>Inspired by the MoE design paradigms established in large language models (LLMs), we identify a set of crucial architectural factors for building effective Diffusion MoE models.<n>We present novel architectures that can be efficiently applied to both latent and pixel-space diffusion frameworks.
arXiv Detail & Related papers (2025-12-01T03:52:31Z)
MoA: Heterogeneous Mixture of Adapters for Parameter-Efficient Fine-Tuning of Large Language Models [61.89384981175277]
We propose a emphheterogeneous textbfMixture-of-Adapters (MoA) approach to integrate Low-Rank Adaptation (LoRA) and Mixture-of-Experts (MoE)<n> Experimental results demonstrate that heterogeneous MoA outperforms homogeneous MoE-LoRA methods in both performance and parameter efficiency.
arXiv Detail & Related papers (2025-06-06T09:54:19Z)
S'MoRE: Structural Mixture of Residual Experts for LLM Fine-tuning [17.579948649237497]
We propose Structural Mixture of Residual Experts (S'MoRE), a novel framework that seamlessly integrates the efficiency of LoRA with the flexibility of MoE. Specifically, S'MoRE employs hierarchical low-rank decomposition of expert weights, yielding residuals of varying orders interconnected in a multi-layer structure. We prove that S'MoRE improves "structural flexibility" of traditional MoE (or Mixture-of-LoRA) by exponential order.
arXiv Detail & Related papers (2025-04-08T20:54:00Z)
AdaMMS: Model Merging for Heterogeneous Multimodal Large Language Models with Unsupervised Coefficient Optimization [86.8133939108057]
We propose AdaMMS, a novel model merging method tailored for heterogeneous MLLMs. Our method tackles the challenges in three steps: mapping, merging and searching. As the first model merging method capable of merging heterogeneous MLLMs without labeled data, AdaMMS outperforms previous model merging methods on various vision-language benchmarks.
arXiv Detail & Related papers (2025-03-31T05:13:02Z)
Beyond Standard MoE: Mixture of Latent Experts for Resource-Efficient Language Models [10.623996218106564]
We introduce a novel parameterization methodology that facilitates the mapping of specific experts into a shared latent space. All expert operations are systematically decomposed into two principal components: a shared projection into a lower-dimensional latent space, followed by expert-specific transformations. This factorized approach substantially diminishes parameter count and computational requirements.
arXiv Detail & Related papers (2025-03-29T14:35:34Z)
Reinforced Model Merging [53.84354455400038]
We present an innovative framework termed Reinforced Model Merging (RMM), which encompasses an environment and agent tailored for merging tasks. By utilizing data subsets during the evaluation process, we addressed the bottleneck in the reward feedback phase, thereby accelerating RMM by up to 100 times.
arXiv Detail & Related papers (2025-03-27T08:52:41Z)
OMoE: Diversifying Mixture of Low-Rank Adaptation by Orthogonal Finetuning [3.8813502422318127]
Building mixture-of-experts (MoE) architecture for Low-rank adaptation (LoRA) is emerging as a potential direction in parameter-efficient fine-tuning (PEFT)<n>We first conduct qualitative analysis to indicate that experts collapse to similar representations in vanilla MoE, limiting the capacity of modular design and computational efficiency.<n>Motivated by these findings, we propose Orthogonal Mixture-of-Experts (OMoE)<n>Our method is simple and alleviates memory bottlenecks, as it incurs minimal experts compared to vanilla MoE models.
arXiv Detail & Related papers (2025-01-17T09:27:08Z)
Training-free Heterogeneous Model Merging [40.681362819808136]
We propose an innovative model merging framework designed for heterogeneous models.<n>We show that the merging of structurally heterogeneous models can achieve performance levels comparable to those of homogeneous merging.<n>Our code is publicly available at https://github.com/zju-vipa/training_free_heterogeneous_model_merging.
arXiv Detail & Related papers (2024-12-29T04:49:11Z)
Collective Model Intelligence Requires Compatible Specialization [29.590052023903457]
We show that as models specialize, the similarity in their feature space structure diminishes, hindering their capacity for collective use. We propose a new direction for achieving collective model intelligence through what we call compatible specialization.
arXiv Detail & Related papers (2024-11-04T15:59:16Z)
Efficient and Effective Weight-Ensembling Mixture of Experts for Multi-Task Model Merging [111.8456671452411]
Multi-task learning (MTL) leverages a shared model to accomplish multiple tasks and facilitate knowledge transfer. We propose a Weight-Ensembling Mixture of Experts (WEMoE) method for multi-task model merging. We show that WEMoE and E-WEMoE outperform state-of-the-art (SOTA) model merging methods in terms of MTL performance, generalization, and robustness.
arXiv Detail & Related papers (2024-10-29T07:16:31Z)
Retraining-Free Merging of Sparse MoE via Hierarchical Clustering [14.858134039539697]
This paper introduces Hierarchical Clustering for Sparsely activated Mixture of Experts (HC-SMoE)<n>HC-SMoE is a task-agnostic expert merging framework for parameter reduction without retraining.<n>We provide theoretical analysis and evaluations across multiple zero-shot language tasks to demonstrate HC-SMoE's effectiveness in state-of-the-art models including Qwen and Mixtral.
arXiv Detail & Related papers (2024-10-11T07:36:14Z)
SMILE: Zero-Shot Sparse Mixture of Low-Rank Experts Construction From Pre-Trained Foundation Models [85.67096251281191]
We present an innovative approach to model fusion called zero-shot Sparse MIxture of Low-rank Experts (SMILE) construction. SMILE allows for the upscaling of source models into an MoE model without extra data or further training. We conduct extensive experiments across diverse scenarios, such as image classification and text generation tasks, using full fine-tuning and LoRA fine-tuning.
arXiv Detail & Related papers (2024-08-19T17:32:15Z)
Diversifying the Expert Knowledge for Task-Agnostic Pruning in Sparse Mixture-of-Experts [75.85448576746373]
We propose a method of grouping and pruning similar experts to improve the model's parameter efficiency. We validate the effectiveness of our method by pruning three state-of-the-art MoE architectures. The evaluation shows that our method outperforms other model pruning methods on a range of natural language tasks.
arXiv Detail & Related papers (2024-07-12T17:25:02Z)
When to Update Your Model: Constrained Model-based Reinforcement Learning [50.74369835934703]
We propose a novel and general theoretical scheme for a non-decreasing performance guarantee of model-based RL (MBRL) Our follow-up derived bounds reveal the relationship between model shifts and performance improvement. A further example demonstrates that learning models from a dynamically-varying number of explorations benefit the eventual returns.
arXiv Detail & Related papers (2022-10-15T17:57:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.