Spectral Manifold Regularization for Stable and Modular Routing in Deep MoE Architectures
- URL: http://arxiv.org/abs/2601.03889v1
- Date: Wed, 07 Jan 2026 12:59:37 GMT
- Title: Spectral Manifold Regularization for Stable and Modular Routing in Deep MoE Architectures
- Authors: Ibrahim Delibasoglu,
- Abstract summary: Mixture of Experts (MoE) architectures enable efficient scaling of neural networks but suffer from expert collapse.<n>We propose the Spectrally-Regularized Mixture of Experts (SR-MoE), which imposes geometric constraints on the routing manifold to enforce structural modularity.
- Score: 2.538209532048867
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Mixture of Experts (MoE) architectures enable efficient scaling of neural networks but suffer from expert collapse, where routing converges to a few dominant experts. This reduces model capacity and causes catastrophic interference during adaptation. We propose the Spectrally-Regularized Mixture of Experts (SR-MoE), which imposes geometric constraints on the routing manifold to enforce structural modularity. Our method uses dual regularization: spectral norm constraints bound routing function Lipschitz continuity, while stable rank penalties preserve high-dimensional feature diversity in expert selection. We evaluate SR-MoE across architectural scales and dataset complexities using modular one-shot adaptation tasks. Results show that traditional linear gating fails with increasing depth (accuracy drops up to 4.72% due to expert entanglement), while SR-MoE maintains structural integrity (mean interference -0.32%). Our spectral constraints facilitate positive knowledge transfer, enabling localized expert updates without global performance decay. SR-MoE provides a general solution for building high-capacity, modular networks capable of stable lifelong learning.
Related papers
- Synergistic Intra- and Cross-Layer Regularization Losses for MoE Expert Specialization [10.669680236190432]
We propose two plug-and-play regularization losses that enhance MoE specialization and routing efficiency.<n>We implement both losses as a drop-in Megatron-LM module.
arXiv Detail & Related papers (2026-02-15T14:19:12Z) - L2R: Low-Rank and Lipschitz-Controlled Routing for Mixture-of-Experts [49.90176890917986]
We propose Low-rank & Lipschitz-controlled Routing (L2R), a unified routing framework that reshapes both the routing space and scoring geometry.<n>L2R performs expert assignment in a shared low-rank latent routing space and introduces Saturated Inner-Product Scoring (SIPS) to explicitly control the Lipschitz behavior of routing functions.<n>Experiments on a large-scale language MoE model and a vision MoE setting on ImageNet demonstrate that L2R consistently improves routing stability, expert specialization, and overall model performance.
arXiv Detail & Related papers (2026-01-29T07:18:33Z) - Least-Loaded Expert Parallelism: Load Balancing An Imbalanced Mixture-of-Experts [74.40169987564724]
Expert parallelism (EP) is designed to scale MoE models by distributing experts across multiple devices.<n>Under extreme imbalance, EP can funnel a disproportionate number of tokens to a small number of experts, leading to compute- and memory-bound failures.<n>We propose Least-Loaded Expert Parallelism (LLEP), a novel EP algorithm that dynamically reroutes excess tokens and associated expert parameters from overloaded devices to underutilized ones.
arXiv Detail & Related papers (2026-01-23T18:19:15Z) - PhyG-MoE: A Physics-Guided Mixture-of-Experts Framework for Energy-Efficient GNSS Interference Recognition [49.955269674859004]
This paper introduces PhyG-MoE (Physics-Guided Mixture-of-Experts), a framework designed to align model capacity with signal complexity.<n>Unlike static architectures, the proposed system employs a spectrum-based gating mechanism that routes signals based on their spectral feature entanglement.<n>A high-capacity TransNeXt expert is activated on-demand to disentangle complex features in saturated scenarios, while lightweight experts handle fundamental signals to minimize latency.
arXiv Detail & Related papers (2026-01-19T07:57:52Z) - Mixture of Ranks with Degradation-Aware Routing for One-Step Real-World Image Super-Resolution [76.66229730098759]
In real-world image super-resolution (Real-ISR), existing approaches mainly rely on fine-tuning pre-trained diffusion models.<n>We propose a Mixture-of-Ranks (MoR) architecture for single-step image super-resolution.<n>We introduce a fine-grained expert partitioning strategy that treats each rank in LoRA as an independent expert.
arXiv Detail & Related papers (2025-11-20T04:11:44Z) - ERMoE: Eigen-Reparameterized Mixture-of-Experts for Stable Routing and Interpretable Specialization [13.182475975397251]
ERMoE is a sparse MoE transformer that replaces learned gating logits with an "Eigenbasis Score"<n>We show that ERMoE achieves state-of-the-art accuracy on ImageNet classification and cross-modal image-text retrieval benchmarks.<n>A 3D MRI variant (ERMoE-ba) improves brain age prediction accuracy by more than 7% and yields interpretable expert specializations.
arXiv Detail & Related papers (2025-11-14T05:31:37Z) - Guided by the Experts: Provable Feature Learning Dynamic of Soft-Routed Mixture-of-Experts [11.437368205968573]
This paper advances MoE theory by providing convergence guarantees for joint training of soft-routed MoE models with non-linear routers and experts.<n>We show that a post-training pruning can effectively eliminate redundant neurons, followed by a provably convergent fine-tuning process that reaches global optimality.
arXiv Detail & Related papers (2025-10-08T16:40:31Z) - Towards Generalized Range-View LiDAR Segmentation in Adverse Weather [65.22588361803942]
We identify and analyze the unique challenges that affect the generalization of range-view LiDAR segmentation in severe weather.<n>We propose a modular and lightweight framework that enhances robustness without altering the core architecture of existing models.<n>Our approach significantly improves generalization to adverse weather with minimal inference overhead.
arXiv Detail & Related papers (2025-06-10T16:48:27Z) - Rethinking Gating Mechanism in Sparse MoE: Handling Arbitrary Modality Inputs with Confidence-Guided Gate [18.379123927374042]
We propose ConfSMoE to introduce a two-stage imputation module to handle the missing modality problem for the SMoE architecture.<n>Inspired by our theoretical analysis, ConfSMoE propose a novel expert gating mechanism by detaching the softmax routing score to task confidence score w.r.t ground truth signal.
arXiv Detail & Related papers (2025-05-26T05:18:55Z) - Frequency-Assisted Mamba for Remote Sensing Image Super-Resolution [49.902047563260496]
We develop the first attempt to integrate the Vision State Space Model (Mamba) for remote sensing image (RSI) super-resolution.
To achieve better SR reconstruction, building upon Mamba, we devise a Frequency-assisted Mamba framework, dubbed FMSR.
Our FMSR features a multi-level fusion architecture equipped with the Frequency Selection Module (FSM), Vision State Space Module (VSSM), and Hybrid Gate Module (HGM)
arXiv Detail & Related papers (2024-05-08T11:09:24Z) - Multilinear Mixture of Experts: Scalable Expert Specialization through Factorization [51.98792406392873]
Mixture of Experts (MoE) provides a powerful way to decompose dense layers into smaller, modular computations.
A major challenge lies in the computational cost of scaling the number of experts high enough to achieve fine-grained specialization.
We propose the Multilinear Mixture of Experts ($mu$MoE) layer to address this, focusing on vision models.
arXiv Detail & Related papers (2024-02-19T21:20:22Z) - Efficient Deweather Mixture-of-Experts with Uncertainty-aware
Feature-wise Linear Modulation [44.43376913419967]
We propose an efficient Mixture-of-Experts (MoE) architecture with weight sharing across experts.
MoFME implicitly instantiates multiple experts via learnable activation modulations on a single shared expert block.
Experiments show that our MoFME outperforms the baselines in the image restoration quality by 0.1-0.2 dB.
arXiv Detail & Related papers (2023-12-27T15:23:37Z) - MoEC: Mixture of Expert Clusters [93.63738535295866]
Sparsely Mixture of Experts (MoE) has received great interest due to its promising scaling capability with affordable computational overhead.
MoE converts dense layers into sparse experts, and utilizes a gated routing network to make experts conditionally activated.
However, as the number of experts grows, MoE with outrageous parameters suffers from overfitting and sparse data allocation.
arXiv Detail & Related papers (2022-07-19T06:09:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.