Advancing Expert Specialization for Better MoE
- URL: http://arxiv.org/abs/2505.22323v3
- Date: Mon, 03 Nov 2025 07:21:22 GMT
- Title: Advancing Expert Specialization for Better MoE
- Authors: Hongcan Guo, Haolang Lu, Guoshun Nan, Bolun Chu, Jialin Zhuang, Yuan Yang, Wenhao Che, Sicong Leng, Qimei Cui, Xudong Jiang,
- Abstract summary: Mixture-of-Experts (MoE) models enable efficient scaling of large language models (LLMs) by activating only a subset of experts per input.<n>We observe that the commonly used auxiliary load balancing loss often leads to expert overlap and overly uniform routing.<n>We propose a simple yet effective solution that introduces two complementary objectives.
- Score: 22.88847592702946
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Mixture-of-Experts (MoE) models enable efficient scaling of large language models (LLMs) by activating only a subset of experts per input. However, we observe that the commonly used auxiliary load balancing loss often leads to expert overlap and overly uniform routing, which hinders expert specialization and degrades overall performance during post-training. To address this, we propose a simple yet effective solution that introduces two complementary objectives: (1) an orthogonality loss to encourage experts to process distinct types of tokens, and (2) a variance loss to encourage more discriminative routing decisions. Gradient-level analysis demonstrates that these objectives are compatible with the existing auxiliary loss and contribute to optimizing the training process. Experimental results over various model architectures and across multiple benchmarks show that our method significantly enhances expert specialization. Notably, our method improves classic MoE baselines with auxiliary loss by up to 23.79%, while also maintaining load balancing in downstream tasks, without any architectural modifications or additional components. We will release our code to contribute to the community.
Related papers
- A Replicate-and-Quantize Strategy for Plug-and-Play Load Balancing of Sparse Mixture-of-Experts LLMs [64.8510381475827]
Sparse Mixture-of-Experts (SMoE) architectures are increasingly used to scale large language models efficiently.<n>SMoE models often suffer from severe load imbalance across experts, where a small subset of experts receives most tokens while others are underutilized.<n>We present a systematic analysis of expert routing during inference and identify three findings: (i) load imbalance persists and worsens with larger batch sizes, (ii) selection frequency does not reliably reflect expert importance, and (iii) overall expert workload and importance can be estimated using a small calibration set.
arXiv Detail & Related papers (2026-02-23T15:11:16Z) - Synergistic Intra- and Cross-Layer Regularization Losses for MoE Expert Specialization [10.669680236190432]
We propose two plug-and-play regularization losses that enhance MoE specialization and routing efficiency.<n>We implement both losses as a drop-in Megatron-LM module.
arXiv Detail & Related papers (2026-02-15T14:19:12Z) - EMoE: Eigenbasis-Guided Routing for Mixture-of-Experts [13.726411744252509]
Eigen-Mixture-of-Experts (EMoE) is a novel architecture that leverages a routing mechanism based on a learned orthonormal specialization eigenbasis.<n>EMoE projects input tokens onto this shared eigenbasis and routes them based on their alignment with the principal components of the feature space.<n>This principled, geometric partitioning of data intrinsically promotes both balanced expert utilization and the development of diverse, specialized experts.
arXiv Detail & Related papers (2026-01-17T18:49:25Z) - MoE Pathfinder: Trajectory-driven Expert Pruning [19.790092938955336]
We propose an expert pruning approach based on the trajectory of activated experts across layers.<n>Our approach achieves superior pruning performance on nearly all tasks compared with most existing approaches.
arXiv Detail & Related papers (2025-12-20T17:05:08Z) - Exploiting the Experts: Unauthorized Compression in MoE-LLMs [1.580774794371876]
We study the prunability of MoE-LLMs under task-specific usage.<n>We propose defense strategies that aim to make MoE models harder to compress and fine-tune without authorization.
arXiv Detail & Related papers (2025-11-22T20:08:29Z) - ERMoE: Eigen-Reparameterized Mixture-of-Experts for Stable Routing and Interpretable Specialization [13.182475975397251]
ERMoE is a sparse MoE transformer that replaces learned gating logits with an "Eigenbasis Score"<n>We show that ERMoE achieves state-of-the-art accuracy on ImageNet classification and cross-modal image-text retrieval benchmarks.<n>A 3D MRI variant (ERMoE-ba) improves brain age prediction accuracy by more than 7% and yields interpretable expert specializations.
arXiv Detail & Related papers (2025-11-14T05:31:37Z) - Beyond Redundancy: Diverse and Specialized Multi-Expert Sparse Autoencoder [59.89996751196727]
Sparse autoencoders (SAEs) have emerged as a powerful tool for interpreting large language models.<n>SAEs' hidden layers have high dimensionality to satisfy sparsity constraints, resulting in prohibitive training and inference costs.<n>Recent Mixture of Experts (MoE) approaches attempt to address this by SAEs into narrower expert networks with gated activation.<n>We propose two key innovations: (1) Multiple Expert Activation that simultaneously engages semantically weighted expert subsets to encourage specialization, and (2) Feature Scaling that enhances diversity through adaptive high-frequency scaling.
arXiv Detail & Related papers (2025-11-07T22:19:34Z) - Adaptive Shared Experts with LoRA-Based Mixture of Experts for Multi-Task Learning [49.90176890917986]
Mixture-of-Experts (MoE) has emerged as a powerful framework for multi-task learning (MTL)<n>Existing MoE-MTL methods often rely on single-task pretrained backbones and suffer from redundant adaptation and inefficient knowledge sharing.<n>We propose adaptive shared experts (ASE) within a low-rank adaptation (LoRA) based MoE, where shared experts are assigned router-computed gating weights jointly normalized with sparse experts.
arXiv Detail & Related papers (2025-10-01T06:49:19Z) - EAQuant: Enhancing Post-Training Quantization for MoE Models via Expert-Aware Optimization [46.40666108181214]
Mixture-of-Experts (MoE) models have emerged as a cornerstone of large-scale deep learning.<n>MoE models have inherent complexities that challenge conventional quantization techniques.<n>We propose EAQuant, a novel PTQ framework tailored for MoE architectures.
arXiv Detail & Related papers (2025-06-16T10:18:50Z) - Enhancing CTR Prediction with De-correlated Expert Networks [53.05653547330796]
We propose a De-Correlated MoE (D-MoE) framework, which introduces a Cross-Expert De-Correlation loss to minimize expert correlations.<n>Extensive experiments have been conducted to validate the effectiveness of D-MoE and the de-correlation principle.
arXiv Detail & Related papers (2025-05-23T14:04:38Z) - Unveiling Hidden Collaboration within Mixture-of-Experts in Large Language Models [5.211806751260724]
We propose a hierarchical sparse dictionary learning (HSDL) method that uncovers the collaboration patterns among experts.<n>We also introduce the Contribution-Aware Expert Pruning (CAEP) algorithm, which effectively prunes low-contribution experts.
arXiv Detail & Related papers (2025-04-16T04:06:15Z) - Finding Fantastic Experts in MoEs: A Unified Study for Expert Dropping Strategies and Observations [86.90549830760513]
Sparsely activated Mixture-of-Experts (SMoE) has shown promise in scaling up the learning capacity of neural networks.<n>We propose MoE Experts Compression Suite (MC-Suite) to provide a benchmark for estimating expert importance from diverse perspectives.<n>We present an experimentally validated conjecture that, during expert dropping, SMoEs' instruction-following capabilities are predominantly hurt.
arXiv Detail & Related papers (2025-04-08T00:49:08Z) - On the effectiveness of discrete representations in sparse mixture of experts [33.809432499123275]
We propose a new architecture dubbed Vector-Quantized Mixture of Experts (VQMoE)<n>VQMoE is an effective solution for scaling up model capacity without increasing the computational costs.<n>We show that VQMoE achieves a 28% improvement in routers compared to other SMoE routing methods.
arXiv Detail & Related papers (2024-11-28T22:32:01Z) - Expert-Token Resonance MoE: Bidirectional Routing with Efficiency Affinity-Driven Active Selection [16.062265609569003]
Mixture-of-Experts (MoE) architectures have emerged as a paradigm-shifting approach for large language models (LLMs)<n>We propose a novel expert routing framework that incorporates: (1) An efficient routing mechanism with lightweight computation; (2) An adaptive bidirectional selection mechanism leveraging resonance between experts and tokens; and (3) A module that determines the lower bounds of expert capacity based on dynamic token distribution analysis.
arXiv Detail & Related papers (2024-05-24T02:50:44Z) - SEER-MoE: Sparse Expert Efficiency through Regularization for Mixture-of-Experts [49.01990048827639]
We introduce SEER-MoE, a framework for reducing both the memory footprint and compute requirements of pre-trained MoE models.
The first stage involves pruning the total number of experts using a heavy-hitters counting guidance, while the second stage employs a regularization-based fine-tuning strategy to recover accuracy loss.
Our empirical studies demonstrate the effectiveness of our method, resulting in a sparse MoEs model optimized for inference efficiency with minimal accuracy trade-offs.
arXiv Detail & Related papers (2024-04-07T22:13:43Z) - Harder Tasks Need More Experts: Dynamic Routing in MoE Models [58.18526590138739]
We introduce a novel dynamic expert selection framework for Mixture of Experts (MoE) models.
Our method dynamically selects experts based on the confidence level in expert selection for each input.
arXiv Detail & Related papers (2024-03-12T13:41:15Z) - Multilinear Mixture of Experts: Scalable Expert Specialization through Factorization [51.98792406392873]
Mixture of Experts (MoE) provides a powerful way to decompose dense layers into smaller, modular computations.
A major challenge lies in the computational cost of scaling the number of experts high enough to achieve fine-grained specialization.
We propose the Multilinear Mixture of Experts ($mu$MoE) layer to address this, focusing on vision models.
arXiv Detail & Related papers (2024-02-19T21:20:22Z) - Diversifying the Mixture-of-Experts Representation for Language Models with Orthogonal Optimizer [59.43462055143123]
The Mixture of Experts (MoE) has emerged as a highly successful technique in deep learning.
In this study, we shed light on the homogeneous representation problem, wherein experts in the MoE fail to specialize and lack diversity.
We propose an alternating training strategy that encourages each expert to update in a direction to the subspace spanned by other experts.
arXiv Detail & Related papers (2023-10-15T07:20:28Z) - Soft Merging of Experts with Adaptive Routing [38.962451264172856]
We introduce Soft Merging of Experts with Adaptive Routing (SMEAR)
SMEAR avoids discrete routing by using a single "merged" expert constructed via a weighted average of all of the experts' parameters.
We empirically validate that models using SMEAR outperform models that route based on metadata or learn sparse routing through gradient estimation.
arXiv Detail & Related papers (2023-06-06T15:04:31Z) - MoEC: Mixture of Expert Clusters [93.63738535295866]
Sparsely Mixture of Experts (MoE) has received great interest due to its promising scaling capability with affordable computational overhead.
MoE converts dense layers into sparse experts, and utilizes a gated routing network to make experts conditionally activated.
However, as the number of experts grows, MoE with outrageous parameters suffers from overfitting and sparse data allocation.
arXiv Detail & Related papers (2022-07-19T06:09:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.