Unified Sparse Mixture of Experts
- URL: http://arxiv.org/abs/2503.22996v2
- Date: Mon, 27 Oct 2025 04:51:10 GMT
- Title: Unified Sparse Mixture of Experts
- Authors: Giang Do, Hung Le, Truyen Tran,
- Abstract summary: Sparse Mixture of Experts (SMoEs) models scale the capacity of models while maintaining constant computational overhead.<n>This paper proposes a Unified Sparse Mixture of Experts (USMoE) framework that addresses these limitations.
- Score: 14.774596844618396
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Sparse Mixture of Experts (SMoEs) models scale the capacity of models while maintaining constant computational overhead. Early designs typically relied on a fixed value of $k$, where $k$ represents either the number of experts selected per token or the number of tokens assigned per expert. However, these approaches encounter three key limitations: they may fail to route to important experts or tokens, may assign irrelevant ones, and often suffer from representation collapse among experts. This paper reexamines SMoEs through the lens of \textit{Linear Programming}, and proposes a Unified Sparse Mixture of Experts (USMoE) framework that addresses these limitations. Specifically, our approach introduces a unified mechanism that integrates information from both the expert and token dimensions, and a unified scoring function that linearly combines similarity scores between experts and tokens. We provide both theoretical justification and empirical evidence demonstrating USMoE's effectiveness in overcoming the limitations of traditional routing methods. Through comprehensive evaluations on both clean and corrupted settings for large language models and vision tasks, under both training-free and training scenarios, USMoE achieves up to a 10\% performance improvement over standard approaches or reduces inference costs by up to 14\%, while maintaining competitive accuracy.
Related papers
- $\infty$-MoE: Generalizing Mixture of Experts to Infinite Experts [43.075289015406355]
Mixture of Experts (MoE) selects a few feed-forward networks (FFNs) per token, achieving an effective trade-off between computational cost and performance.<n>We propose $infty$-MoE that selects a portion of the parameters of large FFNs based on continuous values sampled for each token.<n>Experiments show that a GPT-2 Small-based $infty$-MoE model, with 129M active and 186M total parameters, achieves comparable performance to a dense GPT-2 Medium with 350M parameters.
arXiv Detail & Related papers (2026-01-25T03:55:51Z) - Advancing Expert Specialization for Better MoE [22.88847592702946]
Mixture-of-Experts (MoE) models enable efficient scaling of large language models (LLMs) by activating only a subset of experts per input.<n>We observe that the commonly used auxiliary load balancing loss often leads to expert overlap and overly uniform routing.<n>We propose a simple yet effective solution that introduces two complementary objectives.
arXiv Detail & Related papers (2025-05-28T13:09:47Z) - Enhancing CTR Prediction with De-correlated Expert Networks [53.05653547330796]
We propose a De-Correlated MoE (D-MoE) framework, which introduces a Cross-Expert De-Correlation loss to minimize expert correlations.<n>Extensive experiments have been conducted to validate the effectiveness of D-MoE and the de-correlation principle.
arXiv Detail & Related papers (2025-05-23T14:04:38Z) - CompeteSMoE -- Statistically Guaranteed Mixture of Experts Training via Competition [33.34992335920672]
We argue that effective SMoE training remains challenging because of the suboptimal routing process.<n>In this work, we propose competition, a novel mechanism to route tokens to experts with the highest neural response.<n>We develop CompeteSMoE, a simple yet effective algorithm to train large language models by deploying a router to learn the competition policy.
arXiv Detail & Related papers (2025-05-19T17:24:26Z) - Unveiling Hidden Collaboration within Mixture-of-Experts in Large Language Models [5.211806751260724]
We propose a hierarchical sparse dictionary learning (HSDL) method that uncovers the collaboration patterns among experts.
We also introduce the Contribution-Aware Expert Pruning (CAEP) algorithm, which effectively prunes low-contribution experts.
arXiv Detail & Related papers (2025-04-16T04:06:15Z) - Domain-Specific Pruning of Large Mixture-of-Experts Models with Few-shot Demonstrations [48.890534958441016]
This study investigates domain specialization and expert redundancy in large-scale MoE models.<n>We propose a simple yet effective pruning framework, EASY-EP, to identify and retain only the most relevant experts.<n>Experiments on DeepSeek-R1 and DeepSeek-V3-0324 show that our method can achieve comparable performances and $2.99times$ throughput under the same memory budget with full model with only half the experts.
arXiv Detail & Related papers (2025-04-09T11:34:06Z) - Finding Fantastic Experts in MoEs: A Unified Study for Expert Dropping Strategies and Observations [86.90549830760513]
Sparsely activated Mixture-of-Experts (SMoE) has shown promise in scaling up the learning capacity of neural networks.
We propose MoE Experts Compression Suite (MC-Suite) to provide a benchmark for estimating expert importance from diverse perspectives.
We present an experimentally validated conjecture that, during expert dropping, SMoEs' instruction-following capabilities are predominantly hurt.
arXiv Detail & Related papers (2025-04-08T00:49:08Z) - S2MoE: Robust Sparse Mixture of Experts via Stochastic Learning [34.20340688374905]
Sparse Mixture of Experts (SMoE) enables efficient training of large language models by routing input tokens to a select number of experts.<n>Recent studies have focused on improving the router to mitigate this problem, but existing approaches face two key limitations.<n>We propose a novel approach called Sparse Mixture of Experts via Robust Learning (S2MoE), which is a mixture of experts designed to learn from both deterministic and nondeterministic inputs.
arXiv Detail & Related papers (2025-03-29T08:14:27Z) - Convergence Rates for Softmax Gating Mixture of Experts [78.3687645289918]
Mixture of experts (MoE) has emerged as an effective framework to advance the efficiency and scalability of machine learning models.<n>Central to the success of MoE is an adaptive softmax gating mechanism which takes responsibility for determining the relevance of each expert to a given input and then dynamically assigning experts their respective weights.<n>We perform a convergence analysis of parameter estimation and expert estimation under the MoE equipped with the standard softmax gating or its variants, including a dense-to-sparse gating and a hierarchical softmax gating.
arXiv Detail & Related papers (2025-03-05T06:11:24Z) - On the effectiveness of discrete representations in sparse mixture of experts [33.809432499123275]
We propose a new architecture dubbed Vector-Quantized Mixture of Experts (VQMoE)<n>VQMoE is an effective solution for scaling up model capacity without increasing the computational costs.<n>We show that VQMoE achieves a 28% improvement in routers compared to other SMoE routing methods.
arXiv Detail & Related papers (2024-11-28T22:32:01Z) - MoE++: Accelerating Mixture-of-Experts Methods with Zero-Computation Experts [63.67734699877724]
MoE++ is a general and heterogeneous MoE framework that integrates both Feed-Forward Network(FFN) and zero-computation experts.
MoE++ achieves better performance while delivering 1.1-2.1x expert forward throughput compared to a vanilla MoE model of the same size.
arXiv Detail & Related papers (2024-10-09T18:01:27Z) - Beyond Parameter Count: Implicit Bias in Soft Mixture of Experts [44.09546603624385]
We introduce a notion of expert specialization for Soft MoE.
We show that when there are many small experts, the architecture is implicitly biased in a fashion that allows us to efficiently approximate the specialized expert subset.
arXiv Detail & Related papers (2024-09-02T00:39:00Z) - Skywork-MoE: A Deep Dive into Training Techniques for Mixture-of-Experts Language Models [57.582219834039506]
We introduce the training methodologies implemented in the development of Skywork-MoE, a high-performance mixture-of-experts (MoE) large language model (LLM) with 146 billion parameters and 16 experts.
It is based on the pre-existing dense checkpoints of our Skywork-13B model.
arXiv Detail & Related papers (2024-06-03T03:58:41Z) - SEER-MoE: Sparse Expert Efficiency through Regularization for Mixture-of-Experts [49.01990048827639]
We introduce SEER-MoE, a framework for reducing both the memory footprint and compute requirements of pre-trained MoE models.
The first stage involves pruning the total number of experts using a heavy-hitters counting guidance, while the second stage employs a regularization-based fine-tuning strategy to recover accuracy loss.
Our empirical studies demonstrate the effectiveness of our method, resulting in a sparse MoEs model optimized for inference efficiency with minimal accuracy trade-offs.
arXiv Detail & Related papers (2024-04-07T22:13:43Z) - Generalization Error Analysis for Sparse Mixture-of-Experts: A Preliminary Study [65.11303133775857]
Mixture-of-Experts (MoE) computation amalgamates predictions from several specialized sub-models (referred to as experts)
Sparse MoE selectively engages only a limited number, or even just one expert, significantly reducing overhead while empirically preserving, and sometimes even enhancing, performance.
arXiv Detail & Related papers (2024-03-26T05:48:02Z) - Multilinear Mixture of Experts: Scalable Expert Specialization through Factorization [51.98792406392873]
Mixture of Experts (MoE) provides a powerful way to decompose dense layers into smaller, modular computations.
A major challenge lies in the computational cost of scaling the number of experts high enough to achieve fine-grained specialization.
We propose the Multilinear Mixture of Experts ($mu$MoE) layer to address this, focusing on vision models.
arXiv Detail & Related papers (2024-02-19T21:20:22Z) - CompeteSMoE -- Effective Training of Sparse Mixture of Experts via
Competition [52.2034494666179]
Sparse mixture of experts (SMoE) offers an appealing solution to scale up the model complexity beyond the mean of increasing the network's depth or width.
We propose a competition mechanism to address this fundamental challenge of representation collapse.
By routing inputs only to experts with the highest neural response, we show that, under mild assumptions, competition enjoys the same convergence rate as the optimal estimator.
arXiv Detail & Related papers (2024-02-04T15:17:09Z) - Diversifying the Mixture-of-Experts Representation for Language Models with Orthogonal Optimizer [59.43462055143123]
The Mixture of Experts (MoE) has emerged as a highly successful technique in deep learning.
In this study, we shed light on the homogeneous representation problem, wherein experts in the MoE fail to specialize and lack diversity.
We propose an alternating training strategy that encourages each expert to update in a direction to the subspace spanned by other experts.
arXiv Detail & Related papers (2023-10-15T07:20:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.