Related papers: Domain-Specific Pruning of Large Mixture-of-Experts Models with Few-shot Demonstrations

Domain-Specific Pruning of Large Mixture-of-Experts Models with Few-shot Demonstrations

URL: http://arxiv.org/abs/2504.06792v1
Date: Wed, 09 Apr 2025 11:34:06 GMT
Title: Domain-Specific Pruning of Large Mixture-of-Experts Models with Few-shot Demonstrations
Authors: Zican Dong, Han Peng, Peiyu Liu, Wayne Xin Zhao, Dong Wu, Feng Xiao, Zhifeng Wang,
Abstract summary: We investigate domain specialization and expert redundancy in large-scale MoE models.<n>We propose a simple yet effective pruning framework, EASY-EP, to identify and retain only the most relevant experts.<n>Our method can achieve comparable performances and $2.99times$ throughput under the same memory budget with full DeepSeek-R1 with only half the experts.
Score: 48.890534958441016
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Mixture-of-Experts (MoE) models achieve a favorable trade-off between performance and inference efficiency by activating only a subset of experts. However, the memory overhead of storing all experts remains a major limitation, especially in large-scale MoE models such as DeepSeek-R1 (671B). In this study, we investigate domain specialization and expert redundancy in large-scale MoE models and uncover a consistent behavior we term few-shot expert localization, with only a few demonstrations, the model consistently activates a sparse and stable subset of experts. Building on this observation, we propose a simple yet effective pruning framework, EASY-EP, that leverages a few domain-specific demonstrations to identify and retain only the most relevant experts. EASY-EP comprises two key components: output-aware expert importance assessment and expert-level token contribution estimation. The former evaluates the importance of each expert for the current token by considering the gating scores and magnitudes of the outputs of activated experts, while the latter assesses the contribution of tokens based on representation similarities after and before routed experts. Experiments show that our method can achieve comparable performances and $2.99\times$ throughput under the same memory budget with full DeepSeek-R1 with only half the experts. Our code is available at https://github.com/RUCAIBox/EASYEP.

Related papers

Unveiling Super Experts in Mixture-of-Experts Large Language Models [5.36587814108884]
Sparsely activated Mixture-of-Experts (MoE) models have shown promise in enhancing the learning capacity of large language models (LLMs)<n>We present the first discovery and investigation of a distinct subset of experts that play a crucial role in the underlying mechanisms during the model's forward inference.<n>Our findings confirm that MoE LLMs rely on SEs to induce attention sinks, which are crucial for the distribution of attention scores but are significantly disrupted by SE pruning.
arXiv Detail & Related papers (2025-07-31T06:35:33Z)
On DeepSeekMoE: Statistical Benefits of Shared Experts and Normalized Sigmoid Gating [75.29576838162714]
DeepSeekMoE stands out because of two unique features: the deployment of a shared expert strategy and of the normalized sigmoid gating mechanism.<n>We perform a convergence analysis of the expert estimation task to highlight the gains in sample efficiency for both the shared expert strategy and the normalized sigmoid gating.
arXiv Detail & Related papers (2025-05-16T04:58:18Z)
Unveiling Hidden Collaboration within Mixture-of-Experts in Large Language Models [5.211806751260724]
We propose a hierarchical sparse dictionary learning (HSDL) method that uncovers the collaboration patterns among experts. We also introduce the Contribution-Aware Expert Pruning (CAEP) algorithm, which effectively prunes low-contribution experts.
arXiv Detail & Related papers (2025-04-16T04:06:15Z)
Finding Fantastic Experts in MoEs: A Unified Study for Expert Dropping Strategies and Observations [86.90549830760513]
Sparsely activated Mixture-of-Experts (SMoE) has shown promise in scaling up the learning capacity of neural networks.<n>We propose MoE Experts Compression Suite (MC-Suite) to provide a benchmark for estimating expert importance from diverse perspectives.<n>We present an experimentally validated conjecture that, during expert dropping, SMoEs' instruction-following capabilities are predominantly hurt.
arXiv Detail & Related papers (2025-04-08T00:49:08Z)
Convergence Rates for Softmax Gating Mixture of Experts [78.3687645289918]
Mixture of experts (MoE) has emerged as an effective framework to advance the efficiency and scalability of machine learning models.<n>Central to the success of MoE is an adaptive softmax gating mechanism which takes responsibility for determining the relevance of each expert to a given input and then dynamically assigning experts their respective weights.<n>We perform a convergence analysis of parameter estimation and expert estimation under the MoE equipped with the standard softmax gating or its variants, including a dense-to-sparse gating and a hierarchical softmax gating.
arXiv Detail & Related papers (2025-03-05T06:11:24Z)
Beyond Parameter Count: Implicit Bias in Soft Mixture of Experts [44.09546603624385]
We introduce a notion of expert specialization for Soft MoE. We show that when there are many small experts, the architecture is implicitly biased in a fashion that allows us to efficiently approximate the specialized expert subset.
arXiv Detail & Related papers (2024-09-02T00:39:00Z)
Multi-Head Mixture-of-Experts [100.60556163597946]
We propose Multi-Head Mixture-of-Experts (MH-MoE), which employs a multi-head mechanism to split each token into multiple sub-tokens. MH-MoE is straightforward to implement and decouples from other SMoE optimization methods, making it easy to integrate with other SMoE models for enhanced performance.
arXiv Detail & Related papers (2024-04-23T13:47:09Z)
Generalization Error Analysis for Sparse Mixture-of-Experts: A Preliminary Study [65.11303133775857]
Mixture-of-Experts (MoE) computation amalgamates predictions from several specialized sub-models (referred to as experts) Sparse MoE selectively engages only a limited number, or even just one expert, significantly reducing overhead while empirically preserving, and sometimes even enhancing, performance.
arXiv Detail & Related papers (2024-03-26T05:48:02Z)
Merge, Then Compress: Demystify Efficient SMoE with Hints from Its Routing Policy [84.11508381847929]
Sparsely activated Mixture-of-Experts (SMoE) has shown promise to scale up the learning capacity of neural networks. We propose M-SMoE, which leverages routing statistics to guide expert merging. Our MC-SMoE achieves up to 80% memory and a 20% FLOPs reduction, with virtually no loss in performance.
arXiv Detail & Related papers (2023-10-02T16:51:32Z)
MoCaE: Mixture of Calibrated Experts Significantly Improves Object Detection [18.059899772411033]
We find that na"ively combining expert object detectors in a similar way to Deep Ensembles, can often lead to degraded performance. We identify that the primary cause of this issue is that the predictions of the experts do not match their performance. To address this, when constructing the Mixture of Experts, we propose to combine their predictions in a manner which reflects the individual performance of the experts.
arXiv Detail & Related papers (2023-09-26T14:52:51Z)
Sparsely-gated Mixture-of-Expert Layers for CNN Interpretability [3.021134753248103]
Sparsely-gated Mixture of Expert (MoE) layers have been successfully applied for scaling large transformers. In this work, we apply sparse MoE layers to CNNs for computer vision tasks and analyze the resulting effect on model interpretability.
arXiv Detail & Related papers (2022-04-22T09:40:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.