Improving Routing in Sparse Mixture of Experts with Graph of Tokens
- URL: http://arxiv.org/abs/2505.00792v1
- Date: Thu, 01 May 2025 18:44:20 GMT
- Title: Improving Routing in Sparse Mixture of Experts with Graph of Tokens
- Authors: Tam Nguyen, Ngoc N. Tran, Khai Nguyen, Richard G. Baraniuk,
- Abstract summary: We unveil the limitation of Sparse Mixture of Experts (SMoE) through the perspective of the probabilistic graphical model (PGM)<n>We propose the novel Similarity-Aware (S)MoE, which considers interactions between tokens during expert selection.<n>We empirically validate our models on various tasks and domains, showing significant improvements in reducing routing fluctuations.
- Score: 32.46693871593765
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Sparse Mixture of Experts (SMoE) has emerged as a key to achieving unprecedented scalability in deep learning. By activating only a small subset of parameters per sample, SMoE achieves an exponential increase in parameter counts while maintaining a constant computational overhead. However, SMoE models are susceptible to routing fluctuations--changes in the routing of a given input to its target expert--at the late stage of model training, leading to model non-robustness. In this work, we unveil the limitation of SMoE through the perspective of the probabilistic graphical model (PGM). Through this PGM framework, we highlight the independence in the expert-selection of tokens, which exposes the model to routing fluctuation and non-robustness. Alleviating this independence, we propose the novel Similarity-Aware (S)MoE, which considers interactions between tokens during expert selection. We then derive a new PGM underlying an (S)MoE-Attention block, going beyond just a single (S)MoE layer. Leveraging the token similarities captured by the attention matrix, we propose the innovative Attention-Aware (S)MoE, which employs the attention matrix to guide the routing of tokens to appropriate experts in (S)MoE. We theoretically prove that Similarity/Attention-Aware routing help reduce the entropy of expert selection, resulting in more stable token routing mechanisms. We empirically validate our models on various tasks and domains, showing significant improvements in reducing routing fluctuations, enhancing accuracy, and increasing model robustness over the baseline MoE-Transformer with token routing via softmax gating.
Related papers
- RouteMark: A Fingerprint for Intellectual Property Attribution in Routing-based Model Merging [69.2230254959204]
We propose RouteMark, a framework for IP protection in merged MoE models.<n>Our key insight is that task-specific experts exhibit stable and distinctive routing behaviors under probing inputs.<n>For attribution and tampering detection, we introduce a similarity-based matching algorithm.
arXiv Detail & Related papers (2025-08-03T14:51:58Z) - Attention, Please! Revisiting Attentive Probing for Masked Image Modeling [20.39513629593113]
We introduce efficient probing (EP), a cross-attention mechanism that eliminates redundant projections, reduces the number of trainable parameters, and achieves up to a 10$times$ speed-up over conventional multi-head attention.<n>EP generalizes well beyond MIM to diverse pre-training paradigms, produces interpretable attention maps, and achieves strong gains in low-shot and layer-wise settings.
arXiv Detail & Related papers (2025-06-11T21:10:26Z) - DynMoLE: Boosting Mixture of LoRA Experts Fine-Tuning with a Hybrid Routing Mechanism [5.988126768890861]
DynMoLE is a hybrid routing strategy that dynamically adjusts expert selection based on the Tsallis entropy of the router's probability distribution.<n>Our experiments on commonsense reasoning benchmarks demonstrate that DynMoLE achieves substantial performance improvements.
arXiv Detail & Related papers (2025-04-01T11:14:19Z) - Expert Race: A Flexible Routing Strategy for Scaling Diffusion Transformer with Mixture of Experts [33.39800923804871]
We introduce Race-DiT, a novel MoE model for diffusion transformers with a flexible routing strategy, Expert Race.<n>By allowing tokens and experts to compete together and select the top candidates, the model learns to dynamically assign experts to critical tokens.
arXiv Detail & Related papers (2025-03-20T11:45:08Z) - DA-MoE: Towards Dynamic Expert Allocation for Mixture-of-Experts Models [1.4255659581428335]
We propose a novel Dynamically Allocates a variable number of experts for Mixture-of-Experts (DA-MoE) models based on an effective token importance measure.
Our approach consistently outperforms the state-of-the-art Transformer based MoE model on the popular GLUE benchmark.
arXiv Detail & Related papers (2024-09-10T17:36:15Z) - Revisiting SMoE Language Models by Evaluating Inefficiencies with Task Specific Expert Pruning [78.72226641279863]
Sparse Mixture of Expert (SMoE) models have emerged as a scalable alternative to dense models in language modeling.
Our research explores task-specific model pruning to inform decisions about designing SMoE architectures.
We introduce an adaptive task-aware pruning technique UNCURL to reduce the number of experts per MoE layer in an offline manner post-training.
arXiv Detail & Related papers (2024-09-02T22:35:03Z) - MaskMoE: Boosting Token-Level Learning via Routing Mask in Mixture-of-Experts [38.15244333975921]
MaskMoE is capable of maintaining representation diversity while achieving more comprehensive training.
Our method outperforms previous dominant Mixture-of-Experts models in terms of both perplexity (PPL) and downstream task performance.
arXiv Detail & Related papers (2024-07-13T09:22:33Z) - Unchosen Experts Can Contribute Too: Unleashing MoE Models' Power by Self-Contrast [58.98411447739218]
Mixture-of-Experts (MoE) has emerged as a prominent architecture for scaling model size while maintaining computational efficiency.
We propose Self-Contrast Mixture-of-Experts (SCMoE), a training-free strategy that utilizes unchosen experts in a self-contrast manner during inference.
Our method is conceptually simple and computationally lightweight, as it incurs minimal latency compared to greedy decoding.
arXiv Detail & Related papers (2024-05-23T12:45:29Z) - Mitigating Shortcut Learning with Diffusion Counterfactuals and Diverse Ensembles [104.60508550106618]
We propose DiffDiv, an ensemble diversification framework exploiting Diffusion Probabilistic Models (DPMs)<n>We show that DPMs can generate images with novel feature combinations, even when trained on samples displaying correlated input features.<n>We show that DPM-guided diversification is sufficient to remove dependence on shortcut cues, without a need for additional supervised signals.
arXiv Detail & Related papers (2023-11-23T15:47:33Z) - Leveraging Diffusion Disentangled Representations to Mitigate Shortcuts
in Underspecified Visual Tasks [92.32670915472099]
We propose an ensemble diversification framework exploiting the generation of synthetic counterfactuals using Diffusion Probabilistic Models (DPMs)
We show that diffusion-guided diversification can lead models to avert attention from shortcut cues, achieving ensemble diversity performance comparable to previous methods requiring additional data collection.
arXiv Detail & Related papers (2023-10-03T17:37:52Z) - Understanding Self-attention Mechanism via Dynamical System Perspective [58.024376086269015]
Self-attention mechanism (SAM) is widely used in various fields of artificial intelligence.
We show that intrinsic stiffness phenomenon (SP) in the high-precision solution of ordinary differential equations (ODEs) also widely exists in high-performance neural networks (NN)
We show that the SAM is also a stiffness-aware step size adaptor that can enhance the model's representational ability to measure intrinsic SP.
arXiv Detail & Related papers (2023-08-19T08:17:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.