Related papers: Spectral Manifold Regularization for Stable and Modular Routing in Deep MoE Architectures

Spectral Manifold Regularization for Stable and Modular Routing in Deep MoE Architectures

URL: http://arxiv.org/abs/2601.03889v1
Date: Wed, 07 Jan 2026 12:59:37 GMT
Title: Spectral Manifold Regularization for Stable and Modular Routing in Deep MoE Architectures
Authors: Ibrahim Delibasoglu,
Abstract summary: Mixture of Experts (MoE) architectures enable efficient scaling of neural networks but suffer from expert collapse.<n>We propose the Spectrally-Regularized Mixture of Experts (SR-MoE), which imposes geometric constraints on the routing manifold to enforce structural modularity.
Score: 2.538209532048867
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Mixture of Experts (MoE) architectures enable efficient scaling of neural networks but suffer from expert collapse, where routing converges to a few dominant experts. This reduces model capacity and causes catastrophic interference during adaptation. We propose the Spectrally-Regularized Mixture of Experts (SR-MoE), which imposes geometric constraints on the routing manifold to enforce structural modularity. Our method uses dual regularization: spectral norm constraints bound routing function Lipschitz continuity, while stable rank penalties preserve high-dimensional feature diversity in expert selection. We evaluate SR-MoE across architectural scales and dataset complexities using modular one-shot adaptation tasks. Results show that traditional linear gating fails with increasing depth (accuracy drops up to 4.72% due to expert entanglement), while SR-MoE maintains structural integrity (mean interference -0.32%). Our spectral constraints facilitate positive knowledge transfer, enabling localized expert updates without global performance decay. SR-MoE provides a general solution for building high-capacity, modular networks capable of stable lifelong learning.

Related papers

Synergistic Intra- and Cross-Layer Regularization Losses for MoE Expert Specialization [10.669680236190432]
We propose two plug-and-play regularization losses that enhance MoE specialization and routing efficiency.<n>We implement both losses as a drop-in Megatron-LM module.
arXiv Detail & Related papers (2026-02-15T14:19:12Z)
L2R: Low-Rank and Lipschitz-Controlled Routing for Mixture-of-Experts [49.90176890917986]
We propose Low-rank & Lipschitz-controlled Routing (L2R), a unified routing framework that reshapes both the routing space and scoring geometry.<n>L2R performs expert assignment in a shared low-rank latent routing space and introduces Saturated Inner-Product Scoring (SIPS) to explicitly control the Lipschitz behavior of routing functions.<n>Experiments on a large-scale language MoE model and a vision MoE setting on ImageNet demonstrate that L2R consistently improves routing stability, expert specialization, and overall model performance.
arXiv Detail & Related papers (2026-01-29T07:18:33Z)
Least-Loaded Expert Parallelism: Load Balancing An Imbalanced Mixture-of-Experts [74.40169987564724]
Expert parallelism (EP) is designed to scale MoE models by distributing experts across multiple devices.<n>Under extreme imbalance, EP can funnel a disproportionate number of tokens to a small number of experts, leading to compute- and memory-bound failures.<n>We propose Least-Loaded Expert Parallelism (LLEP), a novel EP algorithm that dynamically reroutes excess tokens and associated expert parameters from overloaded devices to underutilized ones.
arXiv Detail & Related papers (2026-01-23T18:19:15Z)
PhyG-MoE: A Physics-Guided Mixture-of-Experts Framework for Energy-Efficient GNSS Interference Recognition [49.955269674859004]
This paper introduces PhyG-MoE (Physics-Guided Mixture-of-Experts), a framework designed to align model capacity with signal complexity.<n>Unlike static architectures, the proposed system employs a spectrum-based gating mechanism that routes signals based on their spectral feature entanglement.<n>A high-capacity TransNeXt expert is activated on-demand to disentangle complex features in saturated scenarios, while lightweight experts handle fundamental signals to minimize latency.
arXiv Detail & Related papers (2026-01-19T07:57:52Z)
Mixture of Ranks with Degradation-Aware Routing for One-Step Real-World Image Super-Resolution [76.66229730098759]
In real-world image super-resolution (Real-ISR), existing approaches mainly rely on fine-tuning pre-trained diffusion models.<n>We propose a Mixture-of-Ranks (MoR) architecture for single-step image super-resolution.<n>We introduce a fine-grained expert partitioning strategy that treats each rank in LoRA as an independent expert.
arXiv Detail & Related papers (2025-11-20T04:11:44Z)
ERMoE: Eigen-Reparameterized Mixture-of-Experts for Stable Routing and Interpretable Specialization [13.182475975397251]
ERMoE is a sparse MoE transformer that replaces learned gating logits with an "Eigenbasis Score"<n>We show that ERMoE achieves state-of-the-art accuracy on ImageNet classification and cross-modal image-text retrieval benchmarks.<n>A 3D MRI variant (ERMoE-ba) improves brain age prediction accuracy by more than 7% and yields interpretable expert specializations.
arXiv Detail & Related papers (2025-11-14T05:31:37Z)
Guided by the Experts: Provable Feature Learning Dynamic of Soft-Routed Mixture-of-Experts [11.437368205968573]
This paper advances MoE theory by providing convergence guarantees for joint training of soft-routed MoE models with non-linear routers and experts.<n>We show that a post-training pruning can effectively eliminate redundant neurons, followed by a provably convergent fine-tuning process that reaches global optimality.
arXiv Detail & Related papers (2025-10-08T16:40:31Z)
Towards Generalized Range-View LiDAR Segmentation in Adverse Weather [65.22588361803942]
We identify and analyze the unique challenges that affect the generalization of range-view LiDAR segmentation in severe weather.<n>We propose a modular and lightweight framework that enhances robustness without altering the core architecture of existing models.<n>Our approach significantly improves generalization to adverse weather with minimal inference overhead.
arXiv Detail & Related papers (2025-06-10T16:48:27Z)
Rethinking Gating Mechanism in Sparse MoE: Handling Arbitrary Modality Inputs with Confidence-Guided Gate [18.379123927374042]
We propose ConfSMoE to introduce a two-stage imputation module to handle the missing modality problem for the SMoE architecture.<n>Inspired by our theoretical analysis, ConfSMoE propose a novel expert gating mechanism by detaching the softmax routing score to task confidence score w.r.t ground truth signal.
arXiv Detail & Related papers (2025-05-26T05:18:55Z)
Frequency-Assisted Mamba for Remote Sensing Image Super-Resolution [49.902047563260496]
We develop the first attempt to integrate the Vision State Space Model (Mamba) for remote sensing image (RSI) super-resolution. To achieve better SR reconstruction, building upon Mamba, we devise a Frequency-assisted Mamba framework, dubbed FMSR. Our FMSR features a multi-level fusion architecture equipped with the Frequency Selection Module (FSM), Vision State Space Module (VSSM), and Hybrid Gate Module (HGM)
arXiv Detail & Related papers (2024-05-08T11:09:24Z)
Multilinear Mixture of Experts: Scalable Expert Specialization through Factorization [51.98792406392873]
Mixture of Experts (MoE) provides a powerful way to decompose dense layers into smaller, modular computations. A major challenge lies in the computational cost of scaling the number of experts high enough to achieve fine-grained specialization. We propose the Multilinear Mixture of Experts ($mu$MoE) layer to address this, focusing on vision models.
arXiv Detail & Related papers (2024-02-19T21:20:22Z)
Efficient Deweather Mixture-of-Experts with Uncertainty-aware Feature-wise Linear Modulation [44.43376913419967]
We propose an efficient Mixture-of-Experts (MoE) architecture with weight sharing across experts. MoFME implicitly instantiates multiple experts via learnable activation modulations on a single shared expert block. Experiments show that our MoFME outperforms the baselines in the image restoration quality by 0.1-0.2 dB.
arXiv Detail & Related papers (2023-12-27T15:23:37Z)
MoEC: Mixture of Expert Clusters [93.63738535295866]
Sparsely Mixture of Experts (MoE) has received great interest due to its promising scaling capability with affordable computational overhead. MoE converts dense layers into sparse experts, and utilizes a gated routing network to make experts conditionally activated. However, as the number of experts grows, MoE with outrageous parameters suffers from overfitting and sparse data allocation.
arXiv Detail & Related papers (2022-07-19T06:09:55Z)

This list is automatically generated from the titles and abstracts of the papers in this site.