Omni-Router: Sharing Routing Decisions in Sparse Mixture-of-Experts for Speech Recognition
- URL: http://arxiv.org/abs/2507.05724v2
- Date: Mon, 21 Jul 2025 18:32:18 GMT
- Title: Omni-Router: Sharing Routing Decisions in Sparse Mixture-of-Experts for Speech Recognition
- Authors: Zijin Gu, Tatiana Likhomanenko, Navdeep Jaitly,
- Abstract summary: Mixture-of-experts (MoE) architectures have expanded from language modeling to automatic speech recognition (ASR)<n>Our analysis reveals that routers in most layers make expert choices that are not strongly correlated with the choices of the routers in other layers.<n>To increase the cooperation between experts in different layers and encourage greater specialization, we use a shared router across different MoE layers.
- Score: 12.160284873788019
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Mixture-of-experts (MoE) architectures have expanded from language modeling to automatic speech recognition (ASR). Traditional MoE methods, such as the Switch Transformer, route experts independently within each layer. Our analysis reveals that routers in most layers make expert choices that are not strongly correlated with the choices of the routers in other layers. To increase the cooperation between experts in different layers and encourage greater specialization, we use a shared router across different MoE layers. We call this model Omni-router Transformer. Extensive experiments on a large-scale pseudo-labeled dataset and evaluations across 10 diverse, out-of-domain ASR benchmarks demonstrate that the Omni-router Transformer is able to achieve lower training loss and consistently outperform dense and Switch Transformer models, reducing average word error rates by 11.2% and 8.2%, respectively, while providing structured expert usage and improved robustness to diverse data.
Related papers
- RouteMark: A Fingerprint for Intellectual Property Attribution in Routing-based Model Merging [69.2230254959204]
We propose RouteMark, a framework for IP protection in merged MoE models.<n>Our key insight is that task-specific experts exhibit stable and distinctive routing behaviors under probing inputs.<n>For attribution and tampering detection, we introduce a similarity-based matching algorithm.
arXiv Detail & Related papers (2025-08-03T14:51:58Z) - Chain-of-Experts: Unlocking the Communication Power of Mixture-of-Experts Models [58.54288496296157]
Chain-of-Experts (CoE) is a new Mixture-of-Experts (MoE) architecture that introduces sequential expert communication within each layer.<n>To support dynamic expert selection across iterations, CoE employs a dedicated router at each step within a layer.
arXiv Detail & Related papers (2025-06-23T02:15:43Z) - Optimizing MoE Routers: Design, Implementation, and Evaluation in Transformer Models [0.0]
Mixture of Experts (MoE) architectures increase large language model scalability, yet their performance depends on the router module that moves tokens to specialized experts.<n>This work provides a comparative analysis of MoE router designs and offers insights into optimizing their performance for efficient and effective large-scale model deployment.
arXiv Detail & Related papers (2025-06-19T15:55:43Z) - Learning to Route Queries Across Knowledge Bases for Step-wise Retrieval-Augmented Reasoning [60.84901522792042]
Multimodal Retrieval-Augmented Generation (MRAG) has shown promise in mitigating hallucinations in Multimodal Large Language Models (MLLMs)<n>We propose R1, a novel MRAG framework that learns to decide when and where to retrieve knowledge based on the evolving reasoning state.<n>R1- can adaptively and effectively leverage diverse KBs, reducing unnecessary retrievals and improving both efficiency and accuracy.
arXiv Detail & Related papers (2025-05-28T08:17:57Z) - Token-Level Prompt Mixture with Parameter-Free Routing for Federated Domain Generalization [51.562474873972086]
Federated domain generalization (FedDG) aims to learn a globally generalizable model from decentralized clients with heterogeneous data.<n>Recent studies have introduced prompt learning to adapt vision-language models (VLMs) in FedDG by learning a single global prompt.<n>We propose TRIP, a Token-level prompt mixture with parameter-free routing framework for FedDG.
arXiv Detail & Related papers (2025-04-29T11:06:03Z) - ReMoE: Fully Differentiable Mixture-of-Experts with ReLU Routing [28.73697327316267]
Sparsely activated Mixture-of-Experts (MoE) models are widely adopted to scale up model capacity without increasing the computation budget.<n>We propose ReMoE, a fully differentiable MoE architecture that offers a simple yet effective drop-in replacement for the conventional TopK+Softmax routing.<n>ReMoE consistently outperforms vanilla TopK-routed MoE across various model sizes, expert counts, and levels of granularity.
arXiv Detail & Related papers (2024-12-19T10:21:20Z) - Routers in Vision Mixture of Experts: An Empirical Study [26.51711534240885]
Mixture-of-Experts (MoE) models are a promising way to scale up model capacity without significantly increasing computational cost.
Key component of MoEs is the router, which decides which subset of parameters (experts) process which feature embeddings (tokens)
arXiv Detail & Related papers (2024-01-29T08:58:07Z) - Soft Merging of Experts with Adaptive Routing [38.962451264172856]
We introduce Soft Merging of Experts with Adaptive Routing (SMEAR)
SMEAR avoids discrete routing by using a single "merged" expert constructed via a weighted average of all of the experts' parameters.
We empirically validate that models using SMEAR outperform models that route based on metadata or learn sparse routing through gradient estimation.
arXiv Detail & Related papers (2023-06-06T15:04:31Z) - StableMoE: Stable Routing Strategy for Mixture of Experts [109.0602120199226]
Mixture-of-Experts (MoE) technique can scale up the model size of Transformers with an affordable computational overhead.
We propose StableMoE with two training stages to address the routing fluctuation problem.
Results show that StableMoE outperforms existing MoE methods in terms of both convergence speed and performance.
arXiv Detail & Related papers (2022-04-18T16:48:19Z) - Taming Sparsely Activated Transformer with Stochastic Experts [76.0711573018493]
Sparsely activated models (SAMs) can easily scale to have outrageously large amounts of parameters without significant increase in computational cost.
In this paper, we propose a new expert-based model, THOR (Transformer witH StOchastic ExpeRts)
Unlike classic expert-based models, such as the Switch Transformer, experts in THOR are randomly activated for each input during training and inference.
arXiv Detail & Related papers (2021-10-08T17:15:47Z) - SpeechMoE: Scaling to Large Acoustic Models with Dynamic Routing Mixture
of Experts [29.582683923988203]
Mixture of Experts (MoE) based Transformer has shown promising results in many domains.
In this work, we explore the MoE based model for speech recognition, named SpeechMoE.
New router architecture is used in SpeechMoE which can simultaneously utilize the information from a shared embedding network.
arXiv Detail & Related papers (2021-05-07T02:38:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.