Mosaic Pruning: A Hierarchical Framework for Generalizable Pruning of Mixture-of-Experts Models
- URL: http://arxiv.org/abs/2511.19822v1
- Date: Tue, 25 Nov 2025 01:24:41 GMT
- Title: Mosaic Pruning: A Hierarchical Framework for Generalizable Pruning of Mixture-of-Experts Models
- Authors: Wentao Hu, Mingkuan Zhao, Shuangyong Song, Xiaoyan Zhu, Xin Lai, Jiayin Wang,
- Abstract summary: We introduce Mosaic Pruning (MoP) for Sparse Mixture-of-Experts (SMoE)<n>MoP builds a functionally comprehensive set of experts through a structured cluster-then-select" process.<n>Unlike methods that optimize for a single corpus, our proposed Mosaic Pruning ensures that the pruned model retains a functionally complementary set of experts.
- Score: 18.395286169436794
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Sparse Mixture-of-Experts (SMoE) architectures have enabled a new frontier in scaling Large Language Models (LLMs), offering superior performance by activating only a fraction of their total parameters during inference. However, their practical deployment is severely hampered by substantial static memory overhead, as all experts must be loaded into memory. Existing post-training pruning methods, while reducing model size, often derive their pruning criteria from a single, general-purpose corpus. This leads to a critical limitation: a catastrophic performance degradation when the pruned model is applied to other domains, necessitating a costly re-pruning for each new domain. To address this generalization gap, we introduce Mosaic Pruning (MoP). The core idea of MoP is to construct a functionally comprehensive set of experts through a structured ``cluster-then-select" process. This process leverages a similarity metric that captures expert performance across different task domains to functionally cluster the experts, and subsequently selects the most representative expert from each cluster based on our proposed Activation Variability Score. Unlike methods that optimize for a single corpus, our proposed Mosaic Pruning ensures that the pruned model retains a functionally complementary set of experts, much like the tiles of a mosaic that together form a complete picture of the original model's capabilities, enabling it to handle diverse downstream tasks.Extensive experiments on various MoE models demonstrate the superiority of our approach. MoP significantly outperforms prior work, achieving a 7.24\% gain on general tasks and 8.92\% on specialized tasks like math reasoning and code generation.
Related papers
- Training Matryoshka Mixture-of-Experts for Elastic Inference-Time Expert Utilization [60.309915093470416]
Matryoshka MoE (M-MoE) is a training framework that instills a coarse-to-fine structure directly into the expert ensemble.<n>Our work paves the way for more practical and adaptable deployments of large-scale MoE models.
arXiv Detail & Related papers (2025-09-30T16:56:44Z) - Symphony-MoE: Harmonizing Disparate Pre-trained Models into a Coherent Mixture-of-Experts [18.18231276284727]
Mixture-of-Experts (MoE) models enable scalable performance by activating large parameter sets sparsely.<n>Recent work employs upcycling, reusing a single pre-trained dense model by replicating its feed-forward network (FFN) layers into experts.<n>This paper addresses this limitation by constructing powerful MoE models using experts sourced from multiple identically-architected but disparate pre-trained models.
arXiv Detail & Related papers (2025-09-23T02:07:14Z) - Mixture of Group Experts for Learning Invariant Representations [25.935653652324532]
Sparsely activated Mixture-of-Experts (MoE) models effectively increase the number of parameters while maintaining consistent computational costs per token.<n>We present a novel perspective on vanilla MoE with top-$k$ routing inspired by sparse representation.<n>We propose a group sparse regularization approach for the input of top-$k$ routing, termed Mixture of Group Experts (MoGE)
arXiv Detail & Related papers (2025-04-12T15:58:02Z) - Cluster-Driven Expert Pruning for Mixture-of-Experts Large Language Models [24.64757529640278]
Cluster-driven Expert Pruning (C-Prune) is a novel two-stage framework for adaptive task-specific compression of large language models.<n>C-Prune operates through layer-wise expert clustering, which groups functionally similar experts within each MoE layer.<n>We validate C-Prune through extensive experiments on multiple MoE models and benchmarks.
arXiv Detail & Related papers (2025-04-10T14:46:26Z) - Retraining-Free Merging of Sparse MoE via Hierarchical Clustering [24.28646376876676]
This paper introduces Hierarchical Clustering for Sparsely activated Mixture of Experts (HC-SMoE)<n>HC-SMoE is a task-agnostic expert merging framework for parameter reduction without retraining.<n>We provide theoretical analysis and evaluations across multiple zero-shot language tasks to demonstrate HC-SMoE's effectiveness in state-of-the-art models including Qwen and Mixtral.
arXiv Detail & Related papers (2024-10-11T07:36:14Z) - Revisiting SMoE Language Models by Evaluating Inefficiencies with Task Specific Expert Pruning [78.72226641279863]
Sparse Mixture of Expert (SMoE) models have emerged as a scalable alternative to dense models in language modeling.
Our research explores task-specific model pruning to inform decisions about designing SMoE architectures.
We introduce an adaptive task-aware pruning technique UNCURL to reduce the number of experts per MoE layer in an offline manner post-training.
arXiv Detail & Related papers (2024-09-02T22:35:03Z) - Diversifying the Expert Knowledge for Task-Agnostic Pruning in Sparse Mixture-of-Experts [75.85448576746373]
We propose a method of grouping and pruning similar experts to improve the model's parameter efficiency.<n>We validate the effectiveness of our method by pruning three state-of-the-art MoE architectures.<n>The evaluation shows that our method outperforms other model pruning methods on a range of natural language tasks.
arXiv Detail & Related papers (2024-07-12T17:25:02Z) - A Provably Effective Method for Pruning Experts in Fine-tuned Sparse Mixture-of-Experts [49.394145046409044]
This paper provides the first provably efficient technique for pruning experts in finetuned MoE models.
We theoretically prove that prioritizing the pruning of the experts with a smaller change of the routers l2 norm from the pretrained model guarantees the preservation of test accuracy.
Although our theoretical analysis is centered on binary classification tasks on simplified MoE architecture, our expert pruning method is verified on large vision MoE models.
arXiv Detail & Related papers (2024-05-26T17:52:58Z) - Mixture of Cluster-conditional LoRA Experts for Vision-language Instruction Tuning [68.94230363140771]
Mixture of Cluster-conditional LoRA Experts (MoCLE)
MoCLE is a novel Mixture of Experts architecture designed to activate the task-customized model parameters based on the instruction clusters.
Experiments on InstructBLIP and LLaVA demonstrate the effectiveness of MoCLE.
arXiv Detail & Related papers (2023-12-19T18:11:19Z) - Omni-SMoLA: Boosting Generalist Multimodal Models with Soft Mixture of Low-rank Experts [74.40198929049959]
Large multi-modal models (LMMs) exhibit remarkable performance across numerous tasks.
generalist LMMs often suffer from performance degradation when tuned over a large collection of tasks.
We propose Omni-SMoLA, an architecture that uses the Soft MoE approach to mix many multimodal low rank experts.
arXiv Detail & Related papers (2023-12-01T23:04:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.