Ban&Pick: Ehancing Performance and Efficiency of MoE-LLMs via Smarter Routing
- URL: http://arxiv.org/abs/2509.06346v2
- Date: Tue, 30 Sep 2025 10:29:33 GMT
- Title: Ban&Pick: Ehancing Performance and Efficiency of MoE-LLMs via Smarter Routing
- Authors: Yuanteng Chen, Peisong Wang, Yuantian Shao, Nanxin Zeng, Chang Xu, Jian Cheng,
- Abstract summary: Ban&Pick is a post-training, plug-and-play strategy for smarter routing.<n>It discovers and reinforces key experts with outsized impact on performance.<n>It delivers free performance gains and inference acceleration without retraining or architectural changes.
- Score: 36.625445642399136
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Sparse Mixture-of-Experts (MoE) has become a key architecture for scaling large language models (LLMs) efficiently. Recent fine-grained MoE designs introduce hundreds of experts per layer, with multiple experts activated per token, enabling stronger specialization. However, during pre-training, routers are optimized mainly for stability and robustness: they converge prematurely and enforce balanced usage, limiting the full potential of model performance and efficiency at inference. In this work, we uncover two overlooked issues: (i) a few highly influential experts are underutilized due to premature and balanced routing decisions; and (ii) enforcing a fixed number of active experts per token introduces substantial redundancy. Instead of retraining models or redesigning MoE architectures, we introduce Ban&Pick, a post-training, plug-and-play strategy for smarter routing. Pick discovers and reinforces key experts-a small group with outsized impact on performance-leading to notable accuracy gains across domains. Ban further dynamically prunes redundant experts based on layer and token sensitivity, delivering faster inference with minimal accuracy loss. Experiments on fine-grained MoE-LLMs (DeepSeek, Qwen3) across math, code, and general reasoning benchmarks demonstrate that Ban\&Pick delivers free performance gains and inference acceleration without retraining or architectural changes. For instance, on Qwen3-30B-A3B, it improves accuracy from 80.67 to 84.66 on AIME2024 and from 65.66 to 68.18 on GPQA-Diamond, while accelerating inference by 1.25x under the vLLM.
Related papers
- Large Language Lobotomy: Jailbreaking Mixture-of-Experts via Expert Silencing [14.891975420982504]
We propose Large Language Lobotomy (L$3$), a training-free, architecture-agnostic attack that compromises safety alignment by exploiting expert routing dynamics.<n>L$3$ learns routing patterns that correlate with refusal, attributes safety behavior to specific experts, and adaptively silences the most safety-relevant experts until harmful outputs are produced.<n>We evaluate L$3$ on eight state-of-the-art open-source MoE LLMs and show that our adaptive expert silencing increases average attack success from 7.3% to 70.4%, reaching up to 86.3%, outperforming prior training-free
arXiv Detail & Related papers (2026-02-09T14:42:11Z) - MoDES: Accelerating Mixture-of-Experts Multimodal Large Language Models via Dynamic Expert Skipping [52.02659589971978]
We propose MoDES, the first training-free framework that adaptively skips experts to enable efficient and accurate MoE MLLM inference.<n>MoDES significantly enhances inference speed, improving the prefilling time by 2.16$times$ and the decoding time by 1.26$times$.
arXiv Detail & Related papers (2025-11-19T18:48:27Z) - Advancing Expert Specialization for Better MoE [22.88847592702946]
Mixture-of-Experts (MoE) models enable efficient scaling of large language models (LLMs) by activating only a subset of experts per input.<n>We observe that the commonly used auxiliary load balancing loss often leads to expert overlap and overly uniform routing.<n>We propose a simple yet effective solution that introduces two complementary objectives.
arXiv Detail & Related papers (2025-05-28T13:09:47Z) - Accelerating MoE Model Inference with Expert Sharding [1.4733737463429546]
Mixture of experts (MoE) models achieve state-of-the-art results in language modeling but suffer from inefficient hardware utilization due to imbalanced token routing and communication overhead.<n>We introduce MoEShard, an inference system that achieves perfect load balancing through tensor sharding of MoE experts.
arXiv Detail & Related papers (2025-03-11T14:15:01Z) - Lynx: Enabling Efficient MoE Inference through Dynamic Batch-Aware Expert Selection [0.3308833414816073]
MoEs are designed for selective expert activation, which forces the activation of all experts and negates MoE's efficiency during the decode phase.
We present Lynx, a system that enables efficient MoE inference through dynamic, batch-aware expert selection.
Our evaluations show that Lynx achieves up to 1.55x reduction in inference latency while maintaining negligible accuracy loss from baseline model.
arXiv Detail & Related papers (2024-11-13T19:18:08Z) - HOBBIT: A Mixed Precision Expert Offloading System for Fast MoE Inference [54.40808356999408]
We present HOBBIT, a mixed precision expert offloading system to enable flexible and efficient MoE inference.
Our key insight is that dynamically replacing less critical cache-miss experts with low precision versions can substantially reduce expert-loading latency.
HOBBIT achieves up to a 9.93x speedup in decoding compared to state-of-the-art MoE offloading systems.
arXiv Detail & Related papers (2024-11-03T04:25:46Z) - Mixture Compressor for Mixture-of-Experts LLMs Gains More [71.0473038084673]
We propose a training-free Mixture-Compressor for Mixture-of-Experts large language models (MoE-LLMs)<n>Our MC integrates static quantization and dynamic pruning to collaboratively achieve extreme compression for MoE-LLMs with less accuracy loss.<n>For instance, at 2.54 bits, MC compresses 76.6% of the model, with only a 3.8% average accuracy loss.
arXiv Detail & Related papers (2024-10-08T18:09:38Z) - AdapMoE: Adaptive Sensitivity-based Expert Gating and Management for Efficient MoE Inference [13.263938935671646]
AdapMoE is an algorithm-system co-design framework for efficient MoE inference.
AdapMoE features adaptive expert gating and management to reduce the on-demand loading overheads.
We show AdapMoE consistently outperforms existing techniques, reducing the average number of activated experts by 25% and achieving a 1.35x speedup without degradation accuracy.
arXiv Detail & Related papers (2024-08-19T03:27:15Z) - A Provably Effective Method for Pruning Experts in Fine-tuned Sparse Mixture-of-Experts [49.394145046409044]
This paper provides the first provably efficient technique for pruning experts in finetuned MoE models.
We theoretically prove that prioritizing the pruning of the experts with a smaller change of the routers l2 norm from the pretrained model guarantees the preservation of test accuracy.
Although our theoretical analysis is centered on binary classification tasks on simplified MoE architecture, our expert pruning method is verified on large vision MoE models.
arXiv Detail & Related papers (2024-05-26T17:52:58Z) - Unchosen Experts Can Contribute Too: Unleashing MoE Models' Power by Self-Contrast [58.98411447739218]
Mixture-of-Experts (MoE) has emerged as a prominent architecture for scaling model size while maintaining computational efficiency.
We propose Self-Contrast Mixture-of-Experts (SCMoE), a training-free strategy that utilizes unchosen experts in a self-contrast manner during inference.
Our method is conceptually simple and computationally lightweight, as it incurs minimal latency compared to greedy decoding.
arXiv Detail & Related papers (2024-05-23T12:45:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.