From Score Distributions to Balance: Plug-and-Play Mixture-of-Experts Routing
- URL: http://arxiv.org/abs/2510.03293v1
- Date: Mon, 29 Sep 2025 16:29:17 GMT
- Title: From Score Distributions to Balance: Plug-and-Play Mixture-of-Experts Routing
- Authors: Rana Shahout, Colin Cai, Yilun Du, Minlan Yu, Michael Mitzenmacher,
- Abstract summary: Mixture-of-Experts (MoE) models can scale parameter capacity by routing each token to a subset of experts.<n> conditional routing shifts the burden on inference memory, limiting the number of experts per device.<n>We present LASER, a plug-and-play, inference-time routing algorithm that balances load while preserving accuracy.
- Score: 52.01745035243826
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Mixture-of-Experts (MoE) models can scale parameter capacity by routing each token to a subset of experts through a learned gate function. While conditional routing reduces training costs, it shifts the burden on inference memory: expert parameters and activations consume memory, limiting the number of experts per device. As tokens are routed, some experts become overloaded while others are underutilized. Because experts are mapped to GPUs, this imbalance translates directly into degraded system performance in terms of latency, throughput, and cost. We present LASER, a plug-and-play, inference-time routing algorithm that balances load while preserving accuracy. LASER adapts to the shape of the gate's score distribution. When scores provide a clear preference, it routes to the strongest experts; when scores are more uniform, it broadens the set of viable experts and routes to the least-loaded among them. Because LASER relies only on gate scores from a trained model, it integrates directly into existing MoE inference pipelines without retraining or finetuning. We evaluate LASER on Mixtral-8x7B and DeepSeek-MoE-16b-chat across four datasets (ARC-Easy, ARC-Challenge, MMLU, and GSM8K). LASER improves load balancing, translating into lower latency and higher throughput, while keeping the accuracy changes negligible.
Related papers
- Least-Loaded Expert Parallelism: Load Balancing An Imbalanced Mixture-of-Experts [74.40169987564724]
Expert parallelism (EP) is designed to scale MoE models by distributing experts across multiple devices.<n>Under extreme imbalance, EP can funnel a disproportionate number of tokens to a small number of experts, leading to compute- and memory-bound failures.<n>We propose Least-Loaded Expert Parallelism (LLEP), a novel EP algorithm that dynamically reroutes excess tokens and associated expert parameters from overloaded devices to underutilized ones.
arXiv Detail & Related papers (2026-01-23T18:19:15Z) - Improving MoE Compute Efficiency by Composing Weight and Data Sparsity [50.654297246411545]
Mixture-of-Experts layers achieve compute efficiency through weight sparsity.<n>Data sparsity, where each expert processes only a subset of tokens, offers a complementary axis.
arXiv Detail & Related papers (2026-01-21T18:53:58Z) - Routing by Analogy: kNN-Augmented Expert Assignment for Mixture-of-Experts [32.65737144630759]
Mixture-of-Experts (MoE) architectures scale large language models efficiently by employing a parametric "router" to dispatch tokens to a sparse subset of experts.<n>We introduce kNN-MoE, a retrieval-augmented routing framework that reuses optimal expert assignments from a memory of similar past cases.<n>Experiments show kNN-MoE outperforms zero-shot baselines and rivals computationally expensive supervised fine-tuning.
arXiv Detail & Related papers (2026-01-05T14:16:11Z) - RepetitionCurse: Measuring and Understanding Router Imbalance in Mixture-of-Experts LLMs under DoS Stress [16.010076395422264]
We show that out-of-distribution prompts can manipulate the routing strategy, which creates computational bottlenecks on certain devices while forcing others to idle.<n>We propose RepetitionCurse, a low-cost black-box strategy to exploit this vulnerability.
arXiv Detail & Related papers (2025-12-30T05:24:26Z) - Dr.LLM: Dynamic Layer Routing in LLMs [55.11953638340419]
Dr.LLM is a retrofittable framework that equips pretrained models with lightweight per-layer routers deciding to skip, execute, or repeat a block.<n>On ARC (logic) and DART (math), Dr.LLM improves accuracy by up to +3.4%p while saving 5 layers per example on average.
arXiv Detail & Related papers (2025-10-14T17:51:26Z) - ProxRouter: Proximity-Weighted LLM Query Routing for Improved Robustness to Outliers [14.831117443453165]
Large language model (LLM) query routers are critical to modern AI platforms.<n>We propose Prox, which applies an exponentially tilted aggregation mechanism to balance bias and variance in nonparametric routers.
arXiv Detail & Related papers (2025-10-10T20:28:14Z) - Load Balancing Mixture of Experts with Similarity Preserving Routers [37.348178220494226]
Sparse Mixture of Experts (MoE) models offer a scalable and efficient architecture for training large neural networks.<n>We introduce a novel load balancing loss that preserves token-wise relational structure.<n>Our results show that applying our loss to the router results in 36% faster convergence and lower redundancy.
arXiv Detail & Related papers (2025-06-16T22:22:59Z) - Mixture of Lookup Experts [63.787712153454464]
Mixture-of-Experts (MoE) activates only a subset of experts during inference.<n>MoLE is a new MoE architecture that is efficient in both communication and VRAM usage.
arXiv Detail & Related papers (2025-03-20T02:31:57Z) - MoETuner: Optimized Mixture of Expert Serving with Balanced Expert Placement and Token Routing [0.6445605125467574]
Mixture-of-Experts (MoE) model architecture has emerged as a promising solution for scaling transformer models efficiently.<n>MoE models need to be distributed across GPU devices, thus face critical performance bottlenecks.<n>We propose an optimal expert-to- GPU assignment that minimizes token routing costs and token processing balances across devices.
arXiv Detail & Related papers (2025-02-10T16:34:36Z) - ME-Switch: A Memory-Efficient Expert Switching Framework for Large Language Models [43.29533894162248]
LLM development involves pre-training a foundation model on massive data, followed by fine-tuning on task-specific data to create specialized experts.
Previous approaches decompose the expert weights as the pre-trained weights plus delta weights, followed by quantizing the delta weights to reduce the model size.
We introduce ME-Switch, a memory-efficient expert switching framework tailored for serving multiple LLMs.
arXiv Detail & Related papers (2024-06-13T12:27:55Z) - Taming Sparsely Activated Transformer with Stochastic Experts [76.0711573018493]
Sparsely activated models (SAMs) can easily scale to have outrageously large amounts of parameters without significant increase in computational cost.
In this paper, we propose a new expert-based model, THOR (Transformer witH StOchastic ExpeRts)
Unlike classic expert-based models, such as the Switch Transformer, experts in THOR are randomly activated for each input during training and inference.
arXiv Detail & Related papers (2021-10-08T17:15:47Z) - BASE Layers: Simplifying Training of Large, Sparse Models [53.98145464002843]
We introduce a new balanced assignment of experts (BASE) layer for large language models.
Sparse layers can dramatically improve the efficiency of training and inference by routing each token to specialized expert modules.
We formulate token-to-expert allocation as a linear assignment problem, allowing an optimal assignment in which each expert receives an equal number of tokens.
arXiv Detail & Related papers (2021-03-30T23:08:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.