Related papers: RepetitionCurse: Measuring and Understanding Router Imbalance in Mixture-of-Experts LLMs under DoS Stress

RepetitionCurse: Measuring and Understanding Router Imbalance in Mixture-of-Experts LLMs under DoS Stress

URL: http://arxiv.org/abs/2512.23995v1
Date: Tue, 30 Dec 2025 05:24:26 GMT
Title: RepetitionCurse: Measuring and Understanding Router Imbalance in Mixture-of-Experts LLMs under DoS Stress
Authors: Ruixuan Huang, Qingyue Wang, Hantao Huang, Yudong Gao, Dong Chen, Shuai Wang, Wei Wang,
Abstract summary: We show that out-of-distribution prompts can manipulate the routing strategy, which creates computational bottlenecks on certain devices while forcing others to idle.<n>We propose RepetitionCurse, a low-cost black-box strategy to exploit this vulnerability.
Score: 16.010076395422264
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Mixture-of-Experts architectures have become the standard for scaling large language models due to their superior parameter efficiency. To accommodate the growing number of experts in practice, modern inference systems commonly adopt expert parallelism to distribute experts across devices. However, the absence of explicit load balancing constraints during inference allows adversarial inputs to trigger severe routing concentration. We demonstrate that out-of-distribution prompts can manipulate the routing strategy such that all tokens are consistently routed to the same set of top-$k$ experts, which creates computational bottlenecks on certain devices while forcing others to idle. This converts an efficiency mechanism into a denial-of-service attack vector, leading to violations of service-level agreements for time to first token. We propose RepetitionCurse, a low-cost black-box strategy to exploit this vulnerability. By identifying a universal flaw in MoE router behavior, RepetitionCurse constructs adversarial prompts using simple repetitive token patterns in a model-agnostic manner. On widely deployed MoE models like Mixtral-8x7B, our method increases end-to-end inference latency by 3.063x, degrading service availability significantly.

Related papers

Generalizing GNNs with Tokenized Mixture of Experts [75.8310720413187]
We show that improving stability requires reducing reliance on shift-sensitive features, leaving an irreducible worst-case generalization floor.<n>We propose STEM-GNN, a pretrain-then-finetune framework with a mixture-of-experts encoder for diverse computation paths.<n>Across nine node, link, and graph benchmarks, STEM-GNN achieves a stronger three-way balance, improving robustness to degree/homophily shifts and to feature/edge corruptions while remaining competitive on clean graphs.
arXiv Detail & Related papers (2026-02-09T22:48:30Z)
Least-Loaded Expert Parallelism: Load Balancing An Imbalanced Mixture-of-Experts [74.40169987564724]
Expert parallelism (EP) is designed to scale MoE models by distributing experts across multiple devices.<n>Under extreme imbalance, EP can funnel a disproportionate number of tokens to a small number of experts, leading to compute- and memory-bound failures.<n>We propose Least-Loaded Expert Parallelism (LLEP), a novel EP algorithm that dynamically reroutes excess tokens and associated expert parameters from overloaded devices to underutilized ones.
arXiv Detail & Related papers (2026-01-23T18:19:15Z)
Routing by Analogy: kNN-Augmented Expert Assignment for Mixture-of-Experts [32.65737144630759]
Mixture-of-Experts (MoE) architectures scale large language models efficiently by employing a parametric "router" to dispatch tokens to a sparse subset of experts.<n>We introduce kNN-MoE, a retrieval-augmented routing framework that reuses optimal expert assignments from a memory of similar past cases.<n>Experiments show kNN-MoE outperforms zero-shot baselines and rivals computationally expensive supervised fine-tuning.
arXiv Detail & Related papers (2026-01-05T14:16:11Z)
Accelerate Speculative Decoding with Sparse Computation in Verification [49.74839681322316]
Speculative decoding accelerates autoregressive language model inference by verifying multiple draft tokens in parallel.<n>Existing sparsification methods are designed primarily for standard token-by-token autoregressive decoding.<n>We propose a sparse verification framework that jointly sparsifies attention, FFN, and MoE components during the verification stage to reduce the dominant computation cost.
arXiv Detail & Related papers (2025-12-26T07:53:41Z)
From Score Distributions to Balance: Plug-and-Play Mixture-of-Experts Routing [52.01745035243826]
Mixture-of-Experts (MoE) models can scale parameter capacity by routing each token to a subset of experts.<n> conditional routing shifts the burden on inference memory, limiting the number of experts per device.<n>We present LASER, a plug-and-play, inference-time routing algorithm that balances load while preserving accuracy.
arXiv Detail & Related papers (2025-09-29T16:29:17Z)
Cost-Aware Contrastive Routing for LLMs [57.30288453580456]
We introduce Cost-Spectrum Contrastive Routing (CSCR), a lightweight framework that maps both prompts and models into a shared embedding space.<n>CSCR consistently outperforms baselines, improving the accuracy-cost tradeoff by up to 25%.
arXiv Detail & Related papers (2025-08-17T20:16:44Z)
Load Balancing Mixture of Experts with Similarity Preserving Routers [30.279616888339543]
Sparse Mixture of Experts (MoE) models offer a scalable and efficient architecture for training large neural networks.<n>We introduce a novel load balancing loss that preserves token-wise relational structure.<n>Our results show that applying our loss to the router results in 36% faster convergence and lower redundancy.
arXiv Detail & Related papers (2025-06-16T22:22:59Z)
Towards Adversarial Robustness of Model-Level Mixture-of-Experts Architectures for Semantic Segmentation [11.311414617703308]
We evaluate the adversarial vulnerability of MoEs for semantic segmentation of urban and highway traffic scenes.<n>We show that MoEs are, in most cases, more robust to per-instance and universal white-box adversarial attacks and can better withstand transfer attacks.
arXiv Detail & Related papers (2024-12-16T09:49:59Z)
FTP: A Fine-grained Token-wise Pruner for Large Language Models via Token Routing [17.01412432658081]
Large language models (LLMs) have demonstrated superior performance across various tasks by adhering to scaling laws.<n>We propose a fine-grained token-wise pruning approach for the LLMs, which presents a learnable router to adaptively identify the less important tokens.<n>Our approach achieves state-of-the-art (SOTA) pruning results, surpassing other existing pruning methods.
arXiv Detail & Related papers (2024-12-16T07:09:46Z)
Gating Dropout: Communication-efficient Regularization for Sparsely Activated Transformers [78.77361169167149]
We propose emphGating Dropout, which allows tokens to ignore the gating network and stay at their local machines. Similar to traditional dropout, we also show that Gating Dropout has a regularization effect during training, resulting in improved generalization performance.
arXiv Detail & Related papers (2022-05-28T05:12:43Z)
On the Representation Collapse of Sparse Mixture of Experts [102.83396489230375]
Sparse mixture of experts provides larger model capacity while requiring a constant computational overhead. It employs the routing mechanism to distribute input tokens to the best-matched experts according to their hidden representations. However, learning such a routing mechanism encourages token clustering around expert centroids, implying a trend toward representation collapse.
arXiv Detail & Related papers (2022-04-20T01:40:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.