Not All Models Suit Expert Offloading: On Local Routing Consistency of Mixture-of-Expert Models
- URL: http://arxiv.org/abs/2505.16056v1
- Date: Wed, 21 May 2025 22:13:09 GMT
- Title: Not All Models Suit Expert Offloading: On Local Routing Consistency of Mixture-of-Expert Models
- Authors: Jingcong Liang, Siyuan Wang, Miren Tian, Yitong Li, Duyu Tang, Zhongyu Wei,
- Abstract summary: Mixture-of-Experts (MoE) enables efficient scaling of large language models with sparsely activated experts during inference.<n>Many systems introduce *expert offloading* that caches a subset of experts in fast memory, leaving others on slow memory to run on CPU or load on demand.<n>We show that models that apply MoE on every layer and do not use shared experts exhibit the highest local routing consistency.
- Score: 35.617468386609254
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Mixture-of-Experts (MoE) enables efficient scaling of large language models (LLMs) with sparsely activated experts during inference. To effectively deploy large MoE models on memory-constrained devices, many systems introduce *expert offloading* that caches a subset of experts in fast memory, leaving others on slow memory to run on CPU or load on demand. While some research has exploited the locality of expert activations, where consecutive tokens activate similar experts, the degree of this **local routing consistency** varies across models and remains understudied. In this paper, we propose two metrics to measure local routing consistency of MoE models: (1) **Segment Routing Best Performance (SRP)**, which evaluates how well a fixed group of experts can cover the needs of a segment of tokens, and (2) **Segment Cache Best Hit Rate (SCH)**, which measures the optimal segment-level cache hit rate under a given cache size limit. We analyzed 20 MoE LLMs with diverse sizes and architectures and found that models that apply MoE on every layer and do not use shared experts exhibit the highest local routing consistency. We further showed that domain-specialized experts contribute more to routing consistency than vocabulary-specialized ones, and that most models can balance between cache effectiveness and efficiency with cache sizes approximately 2x the active experts. These findings pave the way for memory-efficient MoE design and deployment without compromising inference speed. We publish the code for replicating experiments at https://github.com/ljcleo/moe-lrc .
Related papers
- Chain-of-Experts: Unlocking the Communication Power of Mixture-of-Experts Models [58.54288496296157]
Chain-of-Experts (CoE) is a new Mixture-of-Experts (MoE) architecture that introduces sequential expert communication within each layer.<n>To support dynamic expert selection across iterations, CoE employs a dedicated router at each step within a layer.
arXiv Detail & Related papers (2025-06-23T02:15:43Z) - CMoE: Fast Carving of Mixture-of-Experts for Efficient LLM Inference [33.871080938643566]
Large language models (LLMs) achieve impressive performance by scaling model parameters, but this comes with significant inference overhead.<n>We propose CMoE, a novel framework to efficiently carve MoE models from dense models.<n>CMoE achieves remarkable performance through efficient expert grouping and lightweight adaptation.
arXiv Detail & Related papers (2025-02-06T14:05:30Z) - Mixture of Cache-Conditional Experts for Efficient Mobile Device Inference [14.57414071160821]
We introduce a novel cache-aware routing strategy that leverages expert reuse during token generation to improve cache locality.<n>We present on-device results demonstrating 2$times$ speedups on mobile devices.
arXiv Detail & Related papers (2024-11-27T18:59:48Z) - Condense, Don't Just Prune: Enhancing Efficiency and Performance in MoE Layer Pruning [26.945473092961123]
We propose ConDense-MoE, which condenses the large, sparse MoE layer into a smaller, denser layer with only a few experts activated for all tokens.<n>Our approach is specifically designed for fine-grained MoE with shared experts, where Feed-Forward Networks are split into many small experts.
arXiv Detail & Related papers (2024-11-26T00:56:18Z) - HOBBIT: A Mixed Precision Expert Offloading System for Fast MoE Inference [54.40808356999408]
We present HOBBIT, a mixed precision expert offloading system to enable flexible and efficient MoE inference.
Our key insight is that dynamically replacing less critical cache-miss experts with low precision versions can substantially reduce expert-loading latency.
HOBBIT achieves up to a 9.93x speedup in decoding compared to state-of-the-art MoE offloading systems.
arXiv Detail & Related papers (2024-11-03T04:25:46Z) - RouterRetriever: Routing over a Mixture of Expert Embedding Models [58.987116118425995]
We introduce RouterRetriever, a retrieval model that leverages a mixture of domain-specific experts by using a routing mechanism.<n> RouterRetriever is the first work to demonstrate the advantages of routing over a mixture of domain-specific expert embedding models.
arXiv Detail & Related papers (2024-09-04T13:16:55Z) - BAM! Just Like That: Simple and Efficient Parameter Upcycling for Mixture of Experts [41.83123857437985]
Training MoEs from scratch in a large-scale regime is prohibitively expensive.
We propose BAM (Branch-Attend-Mix), a simple yet effective method that addresses this shortcoming.
Our experiments on seed models ranging from 590 million to 2 billion parameters demonstrate that BAM surpasses baselines in both perplexity and downstream task performance.
arXiv Detail & Related papers (2024-08-15T17:19:12Z) - Diversifying the Expert Knowledge for Task-Agnostic Pruning in Sparse Mixture-of-Experts [75.85448576746373]
We propose a method of grouping and pruning similar experts to improve the model's parameter efficiency.<n>We validate the effectiveness of our method by pruning three state-of-the-art MoE architectures.<n>The evaluation shows that our method outperforms other model pruning methods on a range of natural language tasks.
arXiv Detail & Related papers (2024-07-12T17:25:02Z) - A Provably Effective Method for Pruning Experts in Fine-tuned Sparse Mixture-of-Experts [49.394145046409044]
This paper provides the first provably efficient technique for pruning experts in finetuned MoE models.
We theoretically prove that prioritizing the pruning of the experts with a smaller change of the routers l2 norm from the pretrained model guarantees the preservation of test accuracy.
Although our theoretical analysis is centered on binary classification tasks on simplified MoE architecture, our expert pruning method is verified on large vision MoE models.
arXiv Detail & Related papers (2024-05-26T17:52:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.