Related papers: Synergistic Intra- and Cross-Layer Regularization Losses for MoE Expert Specialization

Synergistic Intra- and Cross-Layer Regularization Losses for MoE Expert Specialization

URL: http://arxiv.org/abs/2602.14159v1
Date: Sun, 15 Feb 2026 14:19:12 GMT
Title: Synergistic Intra- and Cross-Layer Regularization Losses for MoE Expert Specialization
Authors: Rizhen Hu, Yuan Cao, Boao Kong, Mou Sun, Kun Yuan,
Abstract summary: We propose two plug-and-play regularization losses that enhance MoE specialization and routing efficiency.<n>We implement both losses as a drop-in Megatron-LM module.
Score: 10.669680236190432
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Sparse Mixture-of-Experts (MoE) models scale Transformers efficiently but suffer from expert overlap -- redundant representations across experts and routing ambiguity, resulting in severely underutilized model capacity. While architectural solutions like DeepSeekMoE promote specialization, they require substantial structural modifications and rely solely on intra-layer signals. In this paper, we propose two plug-and-play regularization losses that enhance MoE specialization and routing efficiency without modifying router or model architectures. First, an intra-layer specialization loss penalizes cosine similarity between experts' SwiGLU activations on identical tokens, encouraging experts to specialize in complementary knowledge. Second, a cross-layer coupling loss maximizes joint Top-$k$ routing probabilities across adjacent layers, establishing coherent expert pathways through network depth while reinforcing intra-layer expert specialization. Both losses are orthogonal to the standard load-balancing loss and compatible with both the shared-expert architecture in DeepSeekMoE and vanilla top-$k$ MoE architectures. We implement both losses as a drop-in Megatron-LM module. Extensive experiments across pre-training, fine-tuning, and zero-shot benchmarks demonstrate consistent task gains, higher expert specialization, and lower-entropy routing; together, these improvements translate into faster inference via more stable expert pathways.

Related papers

SAME: Stabilized Mixture-of-Experts for Multimodal Continual Instruction Tuning [83.66308307152808]
We propose StAbilized Mixture-of-Experts (SAME) for Multimodal Continual Instruction Tuning (MCIT)<n>SAME stabilizes expert selection by decomposing routing dynamics into subspaces and updating only task-relevant directions.<n>It also introduces adaptive expert activation to freeze selected experts during training, reducing redundant and cross-task interference.
arXiv Detail & Related papers (2026-02-02T11:47:06Z)
Spectral Manifold Regularization for Stable and Modular Routing in Deep MoE Architectures [2.538209532048867]
Mixture of Experts (MoE) architectures enable efficient scaling of neural networks but suffer from expert collapse.<n>We propose the Spectrally-Regularized Mixture of Experts (SR-MoE), which imposes geometric constraints on the routing manifold to enforce structural modularity.
arXiv Detail & Related papers (2026-01-07T12:59:37Z)
ERMoE: Eigen-Reparameterized Mixture-of-Experts for Stable Routing and Interpretable Specialization [13.182475975397251]
ERMoE is a sparse MoE transformer that replaces learned gating logits with an "Eigenbasis Score"<n>We show that ERMoE achieves state-of-the-art accuracy on ImageNet classification and cross-modal image-text retrieval benchmarks.<n>A 3D MRI variant (ERMoE-ba) improves brain age prediction accuracy by more than 7% and yields interpretable expert specializations.
arXiv Detail & Related papers (2025-11-14T05:31:37Z)
Mixture-of-Transformers Learn Faster: A Theoretical Study on Classification Problems [59.94955550958074]
We study a tractable theoretical framework in which each transformer block acts as an expert governed by a continuously trained gating network.<n>We show that expert specialization reduces gradient conflicts and makes each subtask strongly convex.<n>We prove that the training drives the expected prediction loss to near zero in $O(log(epsilon-1)$ steps, significantly improving over the $O(epsilon-1)$ rate for a single transformer.
arXiv Detail & Related papers (2025-10-30T21:07:36Z)
ReXMoE: Reusing Experts with Minimal Overhead in Mixture-of-Experts [25.46805026086543]
We describe ReXMoE, a novel MoE architecture that improves routing beyond the existing layer-local approaches.<n>ReXMoE decouples expert dimensionality from per-layer budgets, enabling richer expert combinations without sacrificing individual expert capacity.
arXiv Detail & Related papers (2025-10-20T12:27:55Z)
Dropping Experts, Recombining Neurons: Retraining-Free Pruning for Sparse Mixture-of-Experts LLMs [54.95810313530111]
DERN is a task-agnostic and retraining-free framework for expert pruning and reconstruction.<n>It improves performance by more than 5% on commonsense reasoning and MMLU benchmarks under 50% expert sparsity.
arXiv Detail & Related papers (2025-09-12T16:09:39Z)
Robust Experts: the Effect of Adversarial Training on CNNs with Sparse Mixture-of-Experts Layers [10.912224105652044]
Robustifying convolutional neural networks (CNNs) against adversarial attacks remains challenging.<n>We explore the use of sparse mixture-of-experts (MoE) layers to improve robustness.<n>We find that inserting a single MoE layer in the deeper stages leads to consistent improvements in robustness.
arXiv Detail & Related papers (2025-09-05T13:25:33Z)
RouteMark: A Fingerprint for Intellectual Property Attribution in Routing-based Model Merging [69.2230254959204]
We propose RouteMark, a framework for IP protection in merged MoE models.<n>Our key insight is that task-specific experts exhibit stable and distinctive routing behaviors under probing inputs.<n>For attribution and tampering detection, we introduce a similarity-based matching algorithm.
arXiv Detail & Related papers (2025-08-03T14:51:58Z)
Advancing Expert Specialization for Better MoE [22.88847592702946]
Mixture-of-Experts (MoE) models enable efficient scaling of large language models (LLMs) by activating only a subset of experts per input.<n>We observe that the commonly used auxiliary load balancing loss often leads to expert overlap and overly uniform routing.<n>We propose a simple yet effective solution that introduces two complementary objectives.
arXiv Detail & Related papers (2025-05-28T13:09:47Z)
Multilinear Mixture of Experts: Scalable Expert Specialization through Factorization [51.98792406392873]
Mixture of Experts (MoE) provides a powerful way to decompose dense layers into smaller, modular computations. A major challenge lies in the computational cost of scaling the number of experts high enough to achieve fine-grained specialization. We propose the Multilinear Mixture of Experts ($mu$MoE) layer to address this, focusing on vision models.
arXiv Detail & Related papers (2024-02-19T21:20:22Z)
Merge, Then Compress: Demystify Efficient SMoE with Hints from Its Routing Policy [84.11508381847929]
Sparsely activated Mixture-of-Experts (SMoE) has shown promise to scale up the learning capacity of neural networks. We propose M-SMoE, which leverages routing statistics to guide expert merging. Our MC-SMoE achieves up to 80% memory and a 20% FLOPs reduction, with virtually no loss in performance.
arXiv Detail & Related papers (2023-10-02T16:51:32Z)
MoEC: Mixture of Expert Clusters [93.63738535295866]
Sparsely Mixture of Experts (MoE) has received great interest due to its promising scaling capability with affordable computational overhead. MoE converts dense layers into sparse experts, and utilizes a gated routing network to make experts conditionally activated. However, as the number of experts grows, MoE with outrageous parameters suffers from overfitting and sparse data allocation.
arXiv Detail & Related papers (2022-07-19T06:09:55Z)

This list is automatically generated from the titles and abstracts of the papers in this site.