Related papers: L2R: Low-Rank and Lipschitz-Controlled Routing for Mixture-of-Experts

L2R: Low-Rank and Lipschitz-Controlled Routing for Mixture-of-Experts

URL: http://arxiv.org/abs/2601.21349v1
Date: Thu, 29 Jan 2026 07:18:33 GMT
Title: L2R: Low-Rank and Lipschitz-Controlled Routing for Mixture-of-Experts
Authors: Minghao Yang, Ren Togo, Guang Li, Takahiro Ogawa, Miki Haseyama,
Abstract summary: We propose Low-rank & Lipschitz-controlled Routing (L2R), a unified routing framework that reshapes both the routing space and scoring geometry.<n>L2R performs expert assignment in a shared low-rank latent routing space and introduces Saturated Inner-Product Scoring (SIPS) to explicitly control the Lipschitz behavior of routing functions.<n>Experiments on a large-scale language MoE model and a vision MoE setting on ImageNet demonstrate that L2R consistently improves routing stability, expert specialization, and overall model performance.
Score: 49.90176890917986
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Mixture-of-Experts (MoE) models scale neural networks by conditionally activating a small subset of experts, where the router plays a central role in determining expert specialization and overall model performance. However, many modern MoE systems still adopt linear routers in raw high-dimensional representation spaces, where representation mismatch, angular concentration, and scale-sensitive scoring can jointly undermine routing discriminability and stable expert specialization. In this work, we propose Low-rank \& Lipschitz-controlled Routing (L2R), a unified routing framework that reshapes both the routing space and scoring geometry. L2R performs expert assignment in a shared low-rank latent routing space and introduces Saturated Inner-Product Scoring (SIPS) to explicitly control the Lipschitz behavior of routing functions, yielding smoother and more stable routing geometry. In addition, L2R incorporates a parameter-efficient multi-anchor routing mechanism to enhance expert expressiveness. Extensive experiments on a large-scale language MoE model and a vision MoE setting on ImageNet demonstrate that L2R consistently improves routing stability, expert specialization, and overall model performance.

Related papers

Least-Loaded Expert Parallelism: Load Balancing An Imbalanced Mixture-of-Experts [74.40169987564724]
Expert parallelism (EP) is designed to scale MoE models by distributing experts across multiple devices.<n>Under extreme imbalance, EP can funnel a disproportionate number of tokens to a small number of experts, leading to compute- and memory-bound failures.<n>We propose Least-Loaded Expert Parallelism (LLEP), a novel EP algorithm that dynamically reroutes excess tokens and associated expert parameters from overloaded devices to underutilized ones.
arXiv Detail & Related papers (2026-01-23T18:19:15Z)
Spectral Manifold Regularization for Stable and Modular Routing in Deep MoE Architectures [2.538209532048867]
Mixture of Experts (MoE) architectures enable efficient scaling of neural networks but suffer from expert collapse.<n>We propose the Spectrally-Regularized Mixture of Experts (SR-MoE), which imposes geometric constraints on the routing manifold to enforce structural modularity.
arXiv Detail & Related papers (2026-01-07T12:59:37Z)
Run, Ruminate, and Regulate: A Dual-process Thinking System for Vision-and-Language Navigation [52.11339614452127]
Vision-and-Language Navigation (VLN) requires an agent to dynamically explore complex 3D environments following human instructions.<n>Recent research underscores the potential of harnessing large language models (LLMs) for VLN, given their commonsense knowledge and general reasoning capabilities.<n>We propose a novel dual-process thinking framework dubbed R3, integrating LLMs' generalization capabilities with VLN-specific expertise in a zero-shot manner.
arXiv Detail & Related papers (2025-11-18T04:32:00Z)
DiSRouter: Distributed Self-Routing for LLM Selections [23.38983740640377]
We introduce DiS (Distributed Self-), a novel paradigm that shifts from centralized control to distributed routing.<n>In DiS, a query traverses a network of LLM agents, each independently deciding whether to answer or route to other agents based on its own self-awareness.<n>Extensive experiments demonstrate that DiS significantly outperforms existing routing methods in utility across various scenarios.
arXiv Detail & Related papers (2025-10-22T03:36:40Z)
FURINA: Free from Unmergeable Router via LINear Aggregation of mixed experts [17.056585698418587]
Mixture of Experts (MoE) has been successfully integrated into Low-Rank Adaptation (LoRA) for parameter-efficient fine-tuning.<n>A key limitation of existing MoE-LoRA methods is their reliance on a discrete router.<n>We propose FURINA, a novel Free from Unmergeable Router framework based on the LINear Aggregation of experts.
arXiv Detail & Related papers (2025-09-18T12:22:32Z)
Routing Mamba: Scaling State Space Models with Mixture-of-Experts Projection [88.47928738482719]
Linear State Space Models (SSMs) offer remarkable performance gains in sequence modeling.<n>Recent advances, such as Mamba, further enhance SSMs with input-dependent gating and hardware-aware implementations.<n>We introduce Routing Mamba (RoM), a novel approach that scales SSM parameters using sparse mixtures of linear projection experts.
arXiv Detail & Related papers (2025-06-22T19:26:55Z)
RadialRouter: Structured Representation for Efficient and Robust Large Language Models Routing [27.481573948464987]
Radial is a novel framework for large language models routing.<n>It uses a lightweight Transformer-based backbone with a radial structure named RadialFormer to articulate the query-LLMs relationship.<n>It significantly outperforms existing routing methods by 9.2% and 5.8% in the Balance and Cost First scenarios.
arXiv Detail & Related papers (2025-06-04T12:16:41Z)
Performance Characterization of Expert Router for Scalable LLM Inference [0.4726677580049183]
Large Language Models (LLMs) have experienced widespread adoption across scientific and industrial domains. deploying and serving these models at scale with optimal throughput and latency remains a significant challenge. This paper introduces Expert Router, a scalable routing architecture that directs to specialized expert models.
arXiv Detail & Related papers (2024-04-22T16:33:42Z)
DAIS: Automatic Channel Pruning via Differentiable Annealing Indicator Search [55.164053971213576]
convolutional neural network has achieved great success in fulfilling computer vision tasks despite large computation overhead. Structured (channel) pruning is usually applied to reduce the model redundancy while preserving the network structure. Existing structured pruning methods require hand-crafted rules which may lead to tremendous pruning space.
arXiv Detail & Related papers (2020-11-04T07:43:01Z)
Optimization-driven Machine Learning for Intelligent Reflecting Surfaces Assisted Wireless Networks [82.33619654835348]
Intelligent surface (IRS) has been employed to reshape the wireless channels by controlling individual scattering elements' phase shifts. Due to the large size of scattering elements, the passive beamforming is typically challenged by the high computational complexity. In this article, we focus on machine learning (ML) approaches for performance in IRS-assisted wireless networks.
arXiv Detail & Related papers (2020-08-29T08:39:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.