Route Experts by Sequence, not by Token
- URL: http://arxiv.org/abs/2511.06494v1
- Date: Sun, 09 Nov 2025 18:36:07 GMT
- Title: Route Experts by Sequence, not by Token
- Authors: Tiansheng Wen, Yifei Wang, Aosong Feng, Long Ma, Xinyang Liu, Yifan Wang, Lixuan Guo, Bo Chen, Stefanie Jegelka, Chenyu You,
- Abstract summary: Mixture-of-Experts (MoE) architectures scale large language models (LLMs) by activating only a subset of experts per token.<n>The standard TopK routing assigns the same fixed number of experts to all tokens, ignoring their varying complexity.<n>We propose Sequence-level TopK (SeqTopK), a minimal modification that shifts the expert budget from the token level to the sequence level.
- Score: 58.92918003265283
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Mixture-of-Experts (MoE) architectures scale large language models (LLMs) by activating only a subset of experts per token, but the standard TopK routing assigns the same fixed number of experts to all tokens, ignoring their varying complexity. Prior adaptive routing methods introduce additional modules and hyperparameters, often requiring costly retraining from scratch. We propose Sequence-level TopK (SeqTopK), a minimal modification that shifts the expert budget from the token level to the sequence level. By selecting the top $T \cdot K$ experts across all $T$ tokens, SeqTopK enables end-to-end learned dynamic allocation -- assigning more experts to difficult tokens and fewer to easy ones -- while preserving the same overall budget. SeqTopK requires only a few lines of code, adds less than 1% overhead, and remains fully compatible with pretrained MoE models. Experiments across math, coding, law, and writing show consistent improvements over TopK and prior parameter-free adaptive methods, with gains that become substantially larger under higher sparsity (up to 16.9%). These results highlight SeqTopK as a simple, efficient, and scalable routing strategy, particularly well-suited for the extreme sparsity regimes of next-generation LLMs. Code is available at https://github.com/Y-Research-SBU/SeqTopK.
Related papers
- SCOPE: Saliency-Coverage Oriented Token Pruning for Efficient Multimodel LLMs [59.415473779171315]
We propose a novel visual token pruning strategy called textbfSaliency-textbfCoverage textbfOriented token textbfPruning for textbfEfficient MLLMs.
arXiv Detail & Related papers (2025-10-28T09:29:37Z) - Distribution-Aware Feature Selection for SAEs [1.2396474483677118]
TopK SAE reconstructs each token from its K most active latents.<n> BatchTopK addresses this limitation by selecting top activations across a batch of tokens.<n>This improves average reconstruction but risks an "activation lottery"
arXiv Detail & Related papers (2025-08-29T04:42:17Z) - Token-Level Prompt Mixture with Parameter-Free Routing for Federated Domain Generalization [51.562474873972086]
Federated domain generalization (FedDG) aims to learn a globally generalizable model from decentralized clients with heterogeneous data.<n>Recent studies have introduced prompt learning to adapt vision-language models (VLMs) in FedDG by learning a single global prompt.<n>We propose TRIP, a Token-level prompt mixture with parameter-free routing framework for FedDG.
arXiv Detail & Related papers (2025-04-29T11:06:03Z) - ZETA: Leveraging Z-order Curves for Efficient Top-k Attention [22.90397380324185]
We propose ZETA to enable parallel querying of past tokens for entire sequences.<n>ZETA matches the performance of standard attention on the synthetic textscMulti-Query Associative Recall task.
arXiv Detail & Related papers (2025-01-24T15:33:05Z) - Entanglement-assisted Quantum Error Correcting Code Saturating The Classical Singleton Bound [44.154181086513574]
We introduce a construction for entanglement-assisted quantum error-correcting codes (EAQECCs) that saturates the classical Singleton bound with less shared entanglement than any known method for code rates below $ frackn = frac13 $.
We demonstrate that any classical $[n,k,d]_q$ code can be transformed into an EAQECC with parameters $[n,k,d;2k]]_q$ using $2k$ pre-shared maximally entangled pairs.
arXiv Detail & Related papers (2024-10-05T11:56:15Z) - Patch-level Routing in Mixture-of-Experts is Provably Sample-efficient
for Convolutional Neural Networks [74.68583356645276]
In deep learning, mixture-of-experts (MoE) activates one or few experts (sub-networks) on a per-sample or per-token basis.
We show for the first time that pMoE provably reduces the required number of training samples to achieve desirable generalization.
arXiv Detail & Related papers (2023-06-07T00:16:10Z) - Mixture-of-Experts with Expert Choice Routing [44.777850078713634]
Prior work allocates a fixed number of experts to each token using a top-k function.
We propose a heterogeneous mixture-of-experts employing an expert choice method.
Our method improves training convergence time by more than 2x.
arXiv Detail & Related papers (2022-02-18T17:46:11Z) - Accelerating BERT Inference for Sequence Labeling via Early-Exit [65.7292767360083]
We extend the recent successful early-exit mechanism to accelerate the inference of PTMs for sequence labeling tasks.
We also propose a token-level early-exit mechanism that allows partial tokens to exit early at different layers.
Our approach can save up to 66%-75% inference cost with minimal performance degradation.
arXiv Detail & Related papers (2021-05-28T14:39:26Z) - BASE Layers: Simplifying Training of Large, Sparse Models [53.98145464002843]
We introduce a new balanced assignment of experts (BASE) layer for large language models.
Sparse layers can dramatically improve the efficiency of training and inference by routing each token to specialized expert modules.
We formulate token-to-expert allocation as a linear assignment problem, allowing an optimal assignment in which each expert receives an equal number of tokens.
arXiv Detail & Related papers (2021-03-30T23:08:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.