DirMoE: Dirichlet-routed Mixture of Experts
- URL: http://arxiv.org/abs/2602.09001v1
- Date: Mon, 09 Feb 2026 18:45:43 GMT
- Title: DirMoE: Dirichlet-routed Mixture of Experts
- Authors: Amirhossein Vahidi, Hesam Asadollahzadeh, Navid Akhavan Attar, Marie Moullet, Kevin Ly, Xingyi Yang, Mohammad Lotfollahi,
- Abstract summary: Mixture-of-Experts (MoE) models have demonstrated exceptional performance in large-scale language models.<n>Existing routers typically rely on non-differentiable Top-$k$+Softmax, limiting their performance and scalability.<n>We introduce Dirichlet-Routed MoE, a novel end-to-end differentiable routing mechanism built on a Dirichlet variational autoencoder framework.
- Score: 26.759827562919725
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Mixture-of-Experts (MoE) models have demonstrated exceptional performance in large-scale language models. Existing routers typically rely on non-differentiable Top-$k$+Softmax, limiting their performance and scalability. We argue that two distinct decisions, which experts to activate and how to distribute expert contributions among them, are conflated in standard Top-$k$+Softmax. We introduce Dirichlet-Routed MoE (DirMoE), a novel end-to-end differentiable routing mechanism built on a Dirichlet variational autoencoder framework. This design fundamentally disentangles the core routing problems: expert selection, modeled by a Bernoulli component, and expert contribution among chosen experts, handled by a Dirichlet component. The entire forward pass remains fully differentiable through the use of Gumbel-Sigmoid relaxation for the expert selection and implicit reparameterization for the Dirichlet distribution. Our training objective, a variational ELBO, includes a direct sparsity penalty that precisely controls the number of active experts in expectation, alongside a schedule for key hyperparameters that guides the model from an exploratory to a definitive routing state. Moreover, our DirMoE router matches or exceeds other methods while improving expert specialization.
Related papers
- SAME: Stabilized Mixture-of-Experts for Multimodal Continual Instruction Tuning [83.66308307152808]
We propose StAbilized Mixture-of-Experts (SAME) for Multimodal Continual Instruction Tuning (MCIT)<n>SAME stabilizes expert selection by decomposing routing dynamics into subspaces and updating only task-relevant directions.<n>It also introduces adaptive expert activation to freeze selected experts during training, reducing redundant and cross-task interference.
arXiv Detail & Related papers (2026-02-02T11:47:06Z) - Token-Level LLM Collaboration via FusionRoute [60.72307345997823]
FusionRoute is a token-level multi-LLM collaboration framework.<n>It selects the most suitable expert at each decoding step and contributes a complementary logit that refines or corrects the selected expert's next-token distribution.<n>It outperforms both sequence- and token-level collaboration, model merging, and direct fine-tuning.
arXiv Detail & Related papers (2026-01-08T16:53:16Z) - The Illusion of Specialization: Unveiling the Domain-Invariant "Standing Committee" in Mixture-of-Experts Models [18.428606280260187]
Mixture of Experts models are widely assumed to achieve domain specialization through sparse routing.<n>We introduce COMMITTEEAUDIT, a framework that analyzes routing behavior at the level of expert groups rather than individual experts.<n>We find that Standing Committees consistently capture the majority of routing mass across domains, layers, and routing budgets.
arXiv Detail & Related papers (2026-01-06T21:29:45Z) - ERMoE: Eigen-Reparameterized Mixture-of-Experts for Stable Routing and Interpretable Specialization [13.182475975397251]
ERMoE is a sparse MoE transformer that replaces learned gating logits with an "Eigenbasis Score"<n>We show that ERMoE achieves state-of-the-art accuracy on ImageNet classification and cross-modal image-text retrieval benchmarks.<n>A 3D MRI variant (ERMoE-ba) improves brain age prediction accuracy by more than 7% and yields interpretable expert specializations.
arXiv Detail & Related papers (2025-11-14T05:31:37Z) - Training Matryoshka Mixture-of-Experts for Elastic Inference-Time Expert Utilization [60.309915093470416]
Matryoshka MoE (M-MoE) is a training framework that instills a coarse-to-fine structure directly into the expert ensemble.<n>Our work paves the way for more practical and adaptable deployments of large-scale MoE models.
arXiv Detail & Related papers (2025-09-30T16:56:44Z) - LD-MoLE: Learnable Dynamic Routing for Mixture of LoRA Experts [24.0422448103907]
We propose LD-MoLE, a Learnable Dynamic routing mechanism for Mixture of LoRA Experts.<n>Our design allows the model to adaptively determine the number of experts to activate for each token at different layers.<n>Our method achieves superior performance, but also demonstrates the ability to learn token-dependent and layer-wise expert allocation.
arXiv Detail & Related papers (2025-09-30T02:38:10Z) - EvoMoE: Expert Evolution in Mixture of Experts for Multimodal Large Language Models [25.12002287083368]
Multi-modal large language models (MLLMs) have increasingly adopted MoE techniques.<n>MoE experts are often by simply replicating the FFN parameters from LLMs.<n>Expert uniformity occurs because MoE experts are often by simply replicating the FFN parameters from LLMs.<n> router rigidity stems from the prevalent use of static linear routers for expert selection.
arXiv Detail & Related papers (2025-05-28T08:38:39Z) - Convergence Rates for Softmax Gating Mixture of Experts [78.3687645289918]
Mixture of experts (MoE) has emerged as an effective framework to advance the efficiency and scalability of machine learning models.<n>Central to the success of MoE is an adaptive softmax gating mechanism which takes responsibility for determining the relevance of each expert to a given input and then dynamically assigning experts their respective weights.<n>We perform a convergence analysis of parameter estimation and expert estimation under the MoE equipped with the standard softmax gating or its variants, including a dense-to-sparse gating and a hierarchical softmax gating.
arXiv Detail & Related papers (2025-03-05T06:11:24Z) - Autonomy-of-Experts Models [34.82103329222486]
We propose a novel MoE paradigm in which experts autonomously select themselves to process inputs.<n>AoE is based on the insight that an expert is aware of its own capacity to effectively process a token.<n>Only the top-ranking experts proceed with the forward pass, while the others abort.
arXiv Detail & Related papers (2025-01-22T18:37:08Z) - Multilinear Mixture of Experts: Scalable Expert Specialization through Factorization [51.98792406392873]
Mixture of Experts (MoE) provides a powerful way to decompose dense layers into smaller, modular computations.
A major challenge lies in the computational cost of scaling the number of experts high enough to achieve fine-grained specialization.
We propose the Multilinear Mixture of Experts ($mu$MoE) layer to address this, focusing on vision models.
arXiv Detail & Related papers (2024-02-19T21:20:22Z) - MoEC: Mixture of Expert Clusters [93.63738535295866]
Sparsely Mixture of Experts (MoE) has received great interest due to its promising scaling capability with affordable computational overhead.
MoE converts dense layers into sparse experts, and utilizes a gated routing network to make experts conditionally activated.
However, as the number of experts grows, MoE with outrageous parameters suffers from overfitting and sparse data allocation.
arXiv Detail & Related papers (2022-07-19T06:09:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.