Mixture of Universal Experts: Scaling Virtual Width via Depth-Width Transformation
- URL: http://arxiv.org/abs/2603.04971v1
- Date: Thu, 05 Mar 2026 09:07:45 GMT
- Title: Mixture of Universal Experts: Scaling Virtual Width via Depth-Width Transformation
- Authors: Yilong Chen, Naibin Gu, Junyuan Shang, Zhenyu Zhang, Yuchen Feng, Jiawei Sheng, Tingwen Liu, Shuohuan Wang, Yu Sun, Hua Wu, Haifeng Wang,
- Abstract summary: Mixture-of-Experts (MoE) decouples model capacity from per-token computation.<n>MoE generalization introduces a novel scaling dimension: Virtual Width.<n>MoE consistently outperforms matched MoE baselines by up to 1.3% across scaling regimes.
- Score: 49.44855760291454
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Mixture-of-Experts (MoE) decouples model capacity from per-token computation, yet their scalability remains limited by the physical dimensions of depth and width. To overcome this, we propose Mixture of Universal Experts (MOUE),a MoE generalization introducing a novel scaling dimension: Virtual Width. In general, MoUE aims to reuse a universal layer-agnostic expert pool across layers, converting depth into virtual width under a fixed per-token activation budget. However, two challenges remain: a routing path explosion from recursive expert reuse, and a mismatch between the exposure induced by reuse and the conventional load-balancing objectives. We address these with three core components: a Staggered Rotational Topology for structured expert sharing, a Universal Expert Load Balance for depth-aware exposure correction, and a Universal Router with lightweight trajectory state for coherent multi-step routing. Empirically, MoUE consistently outperforms matched MoE baselines by up to 1.3% across scaling regimes, enables progressive conversion of existing MoE checkpoints with up to 4.2% gains, and reveals a new scaling dimension for MoE architectures.
Related papers
- M3SR: Multi-Scale Multi-Perceptual Mamba for Efficient Spectral Reconstruction [47.507960245579106]
We propose a multi-scale, multi-perceptual Mamba architecture for the spectral reconstruction task, called M3SR.<n>Specifically, we design a multi-perceptual fusion block to enhance the ability of the model to comprehensively understand and analyze the input features.
arXiv Detail & Related papers (2026-01-13T07:33:38Z) - ReXMoE: Reusing Experts with Minimal Overhead in Mixture-of-Experts [25.46805026086543]
We describe ReXMoE, a novel MoE architecture that improves routing beyond the existing layer-local approaches.<n>ReXMoE decouples expert dimensionality from per-layer budgets, enabling richer expert combinations without sacrificing individual expert capacity.
arXiv Detail & Related papers (2025-10-20T12:27:55Z) - Hierarchical LoRA MoE for Efficient CTR Model Scaling [56.608809143548946]
HiLoMoE is a hierarchical LoRA MoE framework that enables holistic scaling in a parameter-efficient manner.<n>Unlike conventional stacking, HiLoMoE routes based on prior layer scores rather than outputs, allowing all layers to execute in parallel.
arXiv Detail & Related papers (2025-10-12T03:54:11Z) - MuGS: Multi-Baseline Generalizable Gaussian Splatting Reconstruction [32.14335364083271]
We present Multi-Baseline Gaussian Splatting (MuGS), a feed-forward approach for novel view synthesis.<n>MuGS effectively handles diverse baseline settings, including sparse input views with both small and large baselines.<n>We demonstrate promising zero-shot performance on the LLFF and Mip-NeRF 360 datasets.
arXiv Detail & Related papers (2025-08-06T10:34:24Z) - Divide-and-Conquer: Confluent Triple-Flow Network for RGB-T Salient Object Detection [70.84835546732738]
RGB-Thermal Salient Object Detection aims to pinpoint prominent objects within aligned pairs of visible and thermal infrared images.<n>Traditional encoder-decoder architectures may not have adequately considered the robustness against noise originating from defective modalities.<n>We propose the ConTriNet, a robust Confluent Triple-Flow Network employing a Divide-and-Conquer strategy.
arXiv Detail & Related papers (2024-12-02T14:44:39Z) - How Lightweight Can A Vision Transformer Be [0.0]
We explore a strategy that uses Mixture-of-Experts (MoE) to streamline, rather than augment, vision transformers.
Each expert in an MoE layer is a SwiGLU feedforward network, where V and W2 are shared across the layer.
We found that the architecture is competitive even at a size of 0.67M parameters.
arXiv Detail & Related papers (2024-07-25T05:23:20Z) - Enhancing NeRF akin to Enhancing LLMs: Generalizable NeRF Transformer
with Mixture-of-View-Experts [88.23732496104667]
Cross-scene generalizable NeRF models have become a new spotlight of the NeRF field.
We bridge "neuralized" architectures with the powerful Mixture-of-Experts (MoE) idea from large language models.
Our proposed model, dubbed GNT with Mixture-of-View-Experts (GNT-MOVE), has experimentally shown state-of-the-art results when transferring to unseen scenes.
arXiv Detail & Related papers (2023-08-22T21:18:54Z) - Non-local Recurrent Regularization Networks for Multi-view Stereo [108.17325696835542]
In deep multi-view stereo networks, cost regularization is crucial to achieve accurate depth estimation.
We propose a novel non-local recurrent regularization network for multi-view stereo, named NR2-Net.
Our method achieves state-of-the-art reconstruction results on both DTU and Tanks and Temples datasets.
arXiv Detail & Related papers (2021-10-13T01:43:54Z) - Crowd Counting via Hierarchical Scale Recalibration Network [61.09833400167511]
We propose a novel Hierarchical Scale Recalibration Network (HSRNet) to tackle the task of crowd counting.
HSRNet models rich contextual dependencies and recalibrating multiple scale-associated information.
Our approach can ignore various noises selectively and focus on appropriate crowd scales automatically.
arXiv Detail & Related papers (2020-03-07T10:06:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.