SERE: Similarity-based Expert Re-routing for Efficient Batch Decoding in MoE Models
- URL: http://arxiv.org/abs/2602.07616v1
- Date: Sat, 07 Feb 2026 16:51:16 GMT
- Title: SERE: Similarity-based Expert Re-routing for Efficient Batch Decoding in MoE Models
- Authors: Juntong Wu, Jialiang Cheng, Fuyu Lv, Ou Dan, Li Yuan,
- Abstract summary: We present SERE, a Similarity-based Expert Re-routing method for Efficient batch decoding in MoE models.<n>SERE dynamically reduces the number of active experts by re-routing tokens from secondary experts to their most similar primary counterparts.<n>SERE achieves up to 2.0x speedup with minimal quality loss, providing a practical solution for cost-efficient and latency-sensitive large-scale MoE deployment.
- Score: 19.56443760368644
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Mixture-of-Experts (MoE) architectures employ sparse activation to deliver faster training and inference with higher accuracy than dense LLMs. However, in production serving, MoE models require batch inference to optimize hardware efficiency, which may cause excessive expert activation and thus slow the memory-bound decoding stage. To address the fundamental tension between batch decoding and expert sparsity, we present SERE, a Similarity-based Expert Re-routing method for Efficient batch decoding in MoE models. SERE dynamically reduces the number of active experts in an input-aware manner by re-routing tokens from secondary experts to their most similar primary counterparts. It also leverages similarity patterns to identify and preserve critical experts, thereby preventing capability loss. Notably, SERE avoids static expert pruning or merging, instead enabling dynamic expert skipping based on batch-level expert redundancy. Additionally, we provide an efficient custom CUDA kernel for SERE, enabling plug-and-play use in vLLM with only a single-line code change. Extensive experiments on various complex reasoning benchmarks demonstrate that SERE achieves up to 2.0x speedup with minimal quality loss, providing a practical solution for cost-efficient and latency-sensitive large-scale MoE deployment. Code implementation of SERE can be found in https://github.com/JL-Cheng/SERE.
Related papers
- TEAM: Temporal-Spatial Consistency Guided Expert Activation for MoE Diffusion Language Model Acceleration [3.510563137261977]
Diffusion large language models (dLLMs) have recently gained significant attention due to their inherent support for parallel decoding.<n>We identify a fundamental mismatch between MoE architectures and diffusion-based decoding.<n>We propose TEAM, a plug-and-play framework that accelerates MoE dLLMs by enabling more accepted tokens with fewer activated experts.
arXiv Detail & Related papers (2026-02-09T09:05:46Z) - Dynamic Expert Sharing: Decoupling Memory from Parallelism in Mixture-of-Experts Diffusion LLMs [22.399470395813577]
Dynamic Expert Sharing (DES) is a novel technique that shifts MoE optimization from token-centric pruning to sequence-level coreset selection.<n>DES reduces unique expert activations by over 55% and latency by up to 38%, while retaining 99% of vanilla accuracy.
arXiv Detail & Related papers (2026-01-31T20:01:47Z) - MoDES: Accelerating Mixture-of-Experts Multimodal Large Language Models via Dynamic Expert Skipping [52.02659589971978]
We propose MoDES, the first training-free framework that adaptively skips experts to enable efficient and accurate MoE MLLM inference.<n>MoDES significantly enhances inference speed, improving the prefilling time by 2.16$times$ and the decoding time by 1.26$times$.
arXiv Detail & Related papers (2025-11-19T18:48:27Z) - Beyond Redundancy: Diverse and Specialized Multi-Expert Sparse Autoencoder [59.89996751196727]
Sparse autoencoders (SAEs) have emerged as a powerful tool for interpreting large language models.<n>SAEs' hidden layers have high dimensionality to satisfy sparsity constraints, resulting in prohibitive training and inference costs.<n>Recent Mixture of Experts (MoE) approaches attempt to address this by SAEs into narrower expert networks with gated activation.<n>We propose two key innovations: (1) Multiple Expert Activation that simultaneously engages semantically weighted expert subsets to encourage specialization, and (2) Feature Scaling that enhances diversity through adaptive high-frequency scaling.
arXiv Detail & Related papers (2025-11-07T22:19:34Z) - Not All Models Suit Expert Offloading: On Local Routing Consistency of Mixture-of-Expert Models [45.691230716687365]
Mixture-of-Experts (MoE) enables efficient scaling of large language models with sparsely activated experts during inference.<n>Many systems introduce *expert offloading* that caches a subset of experts in fast memory, leaving others on slow memory to run on CPU or load on demand.<n>We show that models that apply MoE on every layer and do not use shared experts exhibit the highest local routing consistency.
arXiv Detail & Related papers (2025-05-21T22:13:09Z) - Accelerating MoE Model Inference with Expert Sharding [1.4733737463429546]
Mixture of experts (MoE) models achieve state-of-the-art results in language modeling but suffer from inefficient hardware utilization due to imbalanced token routing and communication overhead.<n>We introduce MoEShard, an inference system that achieves perfect load balancing through tensor sharding of MoE experts.
arXiv Detail & Related papers (2025-03-11T14:15:01Z) - ResMoE: Space-efficient Compression of Mixture of Experts LLMs via Residual Restoration [61.579842548990754]
Mixture-of-Experts (MoE) Transformer, the backbone of multiple phenomenal language models, leverages sparsity by activating only a fraction of model parameters for each input token.<n>We introduce ResMoE, an innovative MoE approximation framework that utilizes Wasserstein barycenter to extract a common expert (barycenter expert) and approximate the residuals between this barycenter expert and the original ones.
arXiv Detail & Related papers (2025-03-10T03:15:54Z) - HOBBIT: A Mixed Precision Expert Offloading System for Fast MoE Inference [54.40808356999408]
We present HOBBIT, a mixed precision expert offloading system to enable flexible and efficient MoE inference.
Our key insight is that dynamically replacing less critical cache-miss experts with low precision versions can substantially reduce expert-loading latency.
HOBBIT achieves up to a 9.93x speedup in decoding compared to state-of-the-art MoE offloading systems.
arXiv Detail & Related papers (2024-11-03T04:25:46Z) - MoE++: Accelerating Mixture-of-Experts Methods with Zero-Computation Experts [63.67734699877724]
MoE++ is a general and heterogeneous MoE framework that integrates both Feed-Forward Network(FFN) and zero-computation experts.
MoE++ achieves better performance while delivering 1.1-2.1x expert forward throughput compared to a vanilla MoE model of the same size.
arXiv Detail & Related papers (2024-10-09T18:01:27Z) - Not All Experts are Equal: Efficient Expert Pruning and Skipping for Mixture-of-Experts Large Language Models [90.14693869269519]
MoE LLMs can achieve higher performance with fewer parameters, but it is still hard to deploy them due to their immense parameter sizes.
This paper mainly aims to enhance the deployment efficiency of MoE LLMs by introducing plug-and-play expert-level sparsification techniques.
arXiv Detail & Related papers (2024-02-22T18:56:07Z) - Query Encoder Distillation via Embedding Alignment is a Strong Baseline
Method to Boost Dense Retriever Online Efficiency [4.254906060165999]
We show that even a 2-layer, BERT-based query encoder can still retain 92.5% of the full DE performance on the BEIR benchmark.
We hope that our findings will encourage the community to re-evaluate the trade-offs between method complexity and performance improvements.
arXiv Detail & Related papers (2023-06-05T06:53:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.