How Many Experts Are Enough? Towards Optimal Semantic Specialization for Mixture-of-Experts
- URL: http://arxiv.org/abs/2512.19765v1
- Date: Sun, 21 Dec 2025 05:37:42 GMT
- Title: How Many Experts Are Enough? Towards Optimal Semantic Specialization for Mixture-of-Experts
- Authors: Sumin Park, Noseong Park,
- Abstract summary: We propose a semanticaware MoE framework for adaptive expert expansion and dynamic routing.<n>MASS converges to the point of optimal balance between cost-performance trade-off and notably improved sematic specialization.
- Score: 30.125087273625123
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Finding the optimal configuration of Sparse Mixture-ofExperts (SMoE) that maximizes semantic differentiation among experts is essential for exploiting the full potential of MoE architectures. However, existing SMoE frameworks either heavily rely on hyperparameter tuning or overlook the importance of diversifying semantic roles across experts when adapting the expert pool size. We propose Mixture-of-Experts for Adaptive Semantic Specialization (MASS), a semanticaware MoE framework for adaptive expert expansion and dynamic routing. MASS introduces two key advancements: (i) a gradient-based semantic drift detector that prompts targeted expert expansion when the existing expert pool lacks capacity to capture the full semantic diversity of the data, and (ii) an integration of adaptive routing strategy that dynamically adjusts expert usage based on token-level routing confidence mass. We first demonstrate that MASS reliably converges to the point of optimal balance between cost-performance trade-off with notably improved sematic specialization in a highly controlled synthetic setup. Further empirical results on real-world datasets across language and vision domains show that MASS consistently outperforms a range of strong MoE baselines, demonstrating its domain robustness and enhanced expert specialization.
Related papers
- pMoE: Prompting Diverse Experts Together Wins More in Visual Adaptation [68.3777121585281]
We propose a novel Mixture-of-Experts prompt tuning method called pMoE.<n>The proposed pMoE significantly enhances the model's versatility and applicability to a broad spectrum of tasks.<n>We conduct extensive experiments across 47 adaptation tasks, including both classification and segmentation in general and medical domains.
arXiv Detail & Related papers (2026-02-26T12:27:06Z) - SD-MoE: Spectral Decomposition for Effective Expert Specialization [29.649486549025138]
Mixture-of-Experts (MoE) architectures scale Large Language Models via expert specialization induced by conditional computation.<n>Some experts become functionally similar, while others functioning as de facto shared experts, limiting the effective capacity and model performance.<n>We propose Spectral-Decoupled MoE (SD-MoE), which decomposes both parameter and gradient in the spectral space.
arXiv Detail & Related papers (2026-02-13T03:07:26Z) - AnyExperts: On-Demand Expert Allocation for Multimodal Language Models with Mixture of Expert [26.761443359046286]
We propose AnyExperts, a novel on-demand, budget-aware dynamic routing framework.<n>It allocates a variable total number of expert slots per token based on its semantic importance.<n>It is evaluated across diverse tasks in visual understanding, audio understanding, and NLP understanding.
arXiv Detail & Related papers (2025-11-23T06:53:43Z) - Beyond Redundancy: Diverse and Specialized Multi-Expert Sparse Autoencoder [59.89996751196727]
Sparse autoencoders (SAEs) have emerged as a powerful tool for interpreting large language models.<n>SAEs' hidden layers have high dimensionality to satisfy sparsity constraints, resulting in prohibitive training and inference costs.<n>Recent Mixture of Experts (MoE) approaches attempt to address this by SAEs into narrower expert networks with gated activation.<n>We propose two key innovations: (1) Multiple Expert Activation that simultaneously engages semantically weighted expert subsets to encourage specialization, and (2) Feature Scaling that enhances diversity through adaptive high-frequency scaling.
arXiv Detail & Related papers (2025-11-07T22:19:34Z) - ReXMoE: Reusing Experts with Minimal Overhead in Mixture-of-Experts [25.46805026086543]
We describe ReXMoE, a novel MoE architecture that improves routing beyond the existing layer-local approaches.<n>ReXMoE decouples expert dimensionality from per-layer budgets, enabling richer expert combinations without sacrificing individual expert capacity.
arXiv Detail & Related papers (2025-10-20T12:27:55Z) - Mixture-of-Clustered-Experts: Advancing Expert Specialization and Generalization in Instruction Tuning [30.804111793049938]
We propose the Mixture-of-Clustered-Experts (MoCE) to address the limitation through a dual-stage routing mechanism.<n>The first stage in the mechanism performs expert group routing based on sequence-level features, while the second stage activates the top-$k$ experts within the group at the token level.<n>We evaluate MoCE across a comprehensive set of benchmarks, demonstrating its consistent superiority over strong baselines and its enhanced generalization capabilities.
arXiv Detail & Related papers (2025-09-03T07:17:35Z) - MoE-MLoRA for Multi-Domain CTR Prediction: Efficient Adaptation with Expert Specialization [0.0]
MoE-MLoRA is a mixture-of-experts framework where each expert is first trained independently to specialize in its domain.<n>We evaluate MoE-MLoRA across eight CTR models on Movielens and Taobao.
arXiv Detail & Related papers (2025-06-09T09:03:05Z) - On DeepSeekMoE: Statistical Benefits of Shared Experts and Normalized Sigmoid Gating [75.29576838162714]
DeepSeekMoE stands out because of two unique features: the deployment of a shared expert strategy and of the normalized sigmoid gating mechanism.<n>We perform a convergence analysis of the expert estimation task to highlight the gains in sample efficiency for both the shared expert strategy and the normalized sigmoid gating.
arXiv Detail & Related papers (2025-05-16T04:58:18Z) - Convergence Rates for Softmax Gating Mixture of Experts [78.3687645289918]
Mixture of experts (MoE) has emerged as an effective framework to advance the efficiency and scalability of machine learning models.<n>Central to the success of MoE is an adaptive softmax gating mechanism which takes responsibility for determining the relevance of each expert to a given input and then dynamically assigning experts their respective weights.<n>We perform a convergence analysis of parameter estimation and expert estimation under the MoE equipped with the standard softmax gating or its variants, including a dense-to-sparse gating and a hierarchical softmax gating.
arXiv Detail & Related papers (2025-03-05T06:11:24Z) - Flexible and Adaptable Summarization via Expertise Separation [59.26639426529827]
A proficient summarization model should exhibit both flexibility and adaptability.
We propose MoeSumm, a Mixture-of-Expert Summarization architecture.
Our model's distinct separation of general and domain-specific summarization abilities grants it with notable flexibility and adaptability.
arXiv Detail & Related papers (2024-06-08T05:31:19Z) - Multi-Head Mixture-of-Experts [100.60556163597946]
We propose Multi-Head Mixture-of-Experts (MH-MoE), which employs a multi-head mechanism to split each token into multiple sub-tokens.
MH-MoE is straightforward to implement and decouples from other SMoE optimization methods, making it easy to integrate with other SMoE models for enhanced performance.
arXiv Detail & Related papers (2024-04-23T13:47:09Z) - MoEC: Mixture of Expert Clusters [93.63738535295866]
Sparsely Mixture of Experts (MoE) has received great interest due to its promising scaling capability with affordable computational overhead.
MoE converts dense layers into sparse experts, and utilizes a gated routing network to make experts conditionally activated.
However, as the number of experts grows, MoE with outrageous parameters suffers from overfitting and sparse data allocation.
arXiv Detail & Related papers (2022-07-19T06:09:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.