Related papers: Mixture of Universal Experts: Scaling Virtual Width via Depth-Width Transformation

Mixture of Universal Experts: Scaling Virtual Width via Depth-Width Transformation

URL: http://arxiv.org/abs/2603.04971v1
Date: Thu, 05 Mar 2026 09:07:45 GMT
Title: Mixture of Universal Experts: Scaling Virtual Width via Depth-Width Transformation
Authors: Yilong Chen, Naibin Gu, Junyuan Shang, Zhenyu Zhang, Yuchen Feng, Jiawei Sheng, Tingwen Liu, Shuohuan Wang, Yu Sun, Hua Wu, Haifeng Wang,
Abstract summary: Mixture-of-Experts (MoE) decouples model capacity from per-token computation.<n>MoE generalization introduces a novel scaling dimension: Virtual Width.<n>MoE consistently outperforms matched MoE baselines by up to 1.3% across scaling regimes.
Score: 49.44855760291454
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Mixture-of-Experts (MoE) decouples model capacity from per-token computation, yet their scalability remains limited by the physical dimensions of depth and width. To overcome this, we propose Mixture of Universal Experts (MOUE),a MoE generalization introducing a novel scaling dimension: Virtual Width. In general, MoUE aims to reuse a universal layer-agnostic expert pool across layers, converting depth into virtual width under a fixed per-token activation budget. However, two challenges remain: a routing path explosion from recursive expert reuse, and a mismatch between the exposure induced by reuse and the conventional load-balancing objectives. We address these with three core components: a Staggered Rotational Topology for structured expert sharing, a Universal Expert Load Balance for depth-aware exposure correction, and a Universal Router with lightweight trajectory state for coherent multi-step routing. Empirically, MoUE consistently outperforms matched MoE baselines by up to 1.3% across scaling regimes, enables progressive conversion of existing MoE checkpoints with up to 4.2% gains, and reveals a new scaling dimension for MoE architectures.

Related papers

M3SR: Multi-Scale Multi-Perceptual Mamba for Efficient Spectral Reconstruction [47.507960245579106]
We propose a multi-scale, multi-perceptual Mamba architecture for the spectral reconstruction task, called M3SR.<n>Specifically, we design a multi-perceptual fusion block to enhance the ability of the model to comprehensively understand and analyze the input features.
arXiv Detail & Related papers (2026-01-13T07:33:38Z)
ReXMoE: Reusing Experts with Minimal Overhead in Mixture-of-Experts [25.46805026086543]
We describe ReXMoE, a novel MoE architecture that improves routing beyond the existing layer-local approaches.<n>ReXMoE decouples expert dimensionality from per-layer budgets, enabling richer expert combinations without sacrificing individual expert capacity.
arXiv Detail & Related papers (2025-10-20T12:27:55Z)
Hierarchical LoRA MoE for Efficient CTR Model Scaling [56.608809143548946]
HiLoMoE is a hierarchical LoRA MoE framework that enables holistic scaling in a parameter-efficient manner.<n>Unlike conventional stacking, HiLoMoE routes based on prior layer scores rather than outputs, allowing all layers to execute in parallel.
arXiv Detail & Related papers (2025-10-12T03:54:11Z)
MuGS: Multi-Baseline Generalizable Gaussian Splatting Reconstruction [32.14335364083271]
We present Multi-Baseline Gaussian Splatting (MuGS), a feed-forward approach for novel view synthesis.<n>MuGS effectively handles diverse baseline settings, including sparse input views with both small and large baselines.<n>We demonstrate promising zero-shot performance on the LLFF and Mip-NeRF 360 datasets.
arXiv Detail & Related papers (2025-08-06T10:34:24Z)
Divide-and-Conquer: Confluent Triple-Flow Network for RGB-T Salient Object Detection [70.84835546732738]
RGB-Thermal Salient Object Detection aims to pinpoint prominent objects within aligned pairs of visible and thermal infrared images.<n>Traditional encoder-decoder architectures may not have adequately considered the robustness against noise originating from defective modalities.<n>We propose the ConTriNet, a robust Confluent Triple-Flow Network employing a Divide-and-Conquer strategy.
arXiv Detail & Related papers (2024-12-02T14:44:39Z)
How Lightweight Can A Vision Transformer Be [0.0]
We explore a strategy that uses Mixture-of-Experts (MoE) to streamline, rather than augment, vision transformers. Each expert in an MoE layer is a SwiGLU feedforward network, where V and W2 are shared across the layer. We found that the architecture is competitive even at a size of 0.67M parameters.
arXiv Detail & Related papers (2024-07-25T05:23:20Z)
Enhancing NeRF akin to Enhancing LLMs: Generalizable NeRF Transformer with Mixture-of-View-Experts [88.23732496104667]
Cross-scene generalizable NeRF models have become a new spotlight of the NeRF field. We bridge "neuralized" architectures with the powerful Mixture-of-Experts (MoE) idea from large language models. Our proposed model, dubbed GNT with Mixture-of-View-Experts (GNT-MOVE), has experimentally shown state-of-the-art results when transferring to unseen scenes.
arXiv Detail & Related papers (2023-08-22T21:18:54Z)
Non-local Recurrent Regularization Networks for Multi-view Stereo [108.17325696835542]
In deep multi-view stereo networks, cost regularization is crucial to achieve accurate depth estimation. We propose a novel non-local recurrent regularization network for multi-view stereo, named NR2-Net. Our method achieves state-of-the-art reconstruction results on both DTU and Tanks and Temples datasets.
arXiv Detail & Related papers (2021-10-13T01:43:54Z)
Crowd Counting via Hierarchical Scale Recalibration Network [61.09833400167511]
We propose a novel Hierarchical Scale Recalibration Network (HSRNet) to tackle the task of crowd counting. HSRNet models rich contextual dependencies and recalibrating multiple scale-associated information. Our approach can ignore various noises selectively and focus on appropriate crowd scales automatically.
arXiv Detail & Related papers (2020-03-07T10:06:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.