Optimizing MoE Routers: Design, Implementation, and Evaluation in Transformer Models
- URL: http://arxiv.org/abs/2506.16419v1
- Date: Thu, 19 Jun 2025 15:55:43 GMT
- Title: Optimizing MoE Routers: Design, Implementation, and Evaluation in Transformer Models
- Authors: Daniel Fidel Harvey, George Weale, Berk Yilmaz,
- Abstract summary: Mixture of Experts (MoE) architectures increase large language model scalability, yet their performance depends on the router module that moves tokens to specialized experts.<n>This work provides a comparative analysis of MoE router designs and offers insights into optimizing their performance for efficient and effective large-scale model deployment.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Mixture of Experts (MoE) architectures increase large language model scalability, yet their performance depends on the router module that moves tokens to specialized experts. Bad routing can load imbalance and reduced accuracy. This project designed and implemented different router architectures within Transformer models to fix these limitations. We experimented with six distinct router variants Linear, Attention, Multi-Layer Perceptron (MLP), Hybrid, Hash, and our new MLP-Hadamard. We characterized these routers using BERT and the Qwen1.5-MoE model, looking at parameter efficiency, inference latency, routing entropy, and expert utilization patterns. Our evaluations showed distinct trade-offs: Linear routers offer speed, while MLP and Attention routers provide greater expressiveness. The MLP-Hadamard router shows a unique capability for structured, sparse routing. We successfully replaced and fine-tuned custom routers within the complex, quantized Qwen1.5-MoE model. This work provides a comparative analysis of MoE router designs and offers insights into optimizing their performance for efficient and effective large-scale model deployment.
Related papers
- Omni-Router: Sharing Routing Decisions in Sparse Mixture-of-Experts for Speech Recognition [12.160284873788019]
Mixture-of-experts (MoE) architectures have expanded from language modeling to automatic speech recognition (ASR)<n>Our analysis reveals that routers in most layers make expert choices that are not strongly correlated with the choices of the routers in other layers.<n>To increase the cooperation between experts in different layers and encourage greater specialization, we use a shared router across different MoE layers.
arXiv Detail & Related papers (2025-07-08T07:18:33Z) - Mixture of Routers [4.248666380057258]
We propose an efficient fine-tuning method called Mixture of Routers (MoR)<n>MoR uses multiple sub-routers for joint selection and uses a learnable main router to determine the weights of the sub-routers.<n>Results show that MoR outperforms baseline models on most tasks, achieving an average performance improvement of 1%.
arXiv Detail & Related papers (2025-03-30T08:39:09Z) - ReMoE: Fully Differentiable Mixture-of-Experts with ReLU Routing [28.73697327316267]
Sparsely activated Mixture-of-Experts (MoE) models are widely adopted to scale up model capacity without increasing the computation budget.<n>We propose ReMoE, a fully differentiable MoE architecture that offers a simple yet effective drop-in replacement for the conventional TopK+Softmax routing.<n>ReMoE consistently outperforms vanilla TopK-routed MoE across various model sizes, expert counts, and levels of granularity.
arXiv Detail & Related papers (2024-12-19T10:21:20Z) - Read-ME: Refactorizing LLMs as Router-Decoupled Mixture of Experts with System Co-Design [59.00758127310582]
We propose a novel framework Read-ME that transforms pre-trained dense LLMs into smaller MoE models.
Our approach employs activation sparsity to extract experts.
Read-ME outperforms other popular open-source dense models of similar scales.
arXiv Detail & Related papers (2024-10-24T19:48:51Z) - A Provably Effective Method for Pruning Experts in Fine-tuned Sparse Mixture-of-Experts [49.394145046409044]
This paper provides the first provably efficient technique for pruning experts in finetuned MoE models.
We theoretically prove that prioritizing the pruning of the experts with a smaller change of the routers l2 norm from the pretrained model guarantees the preservation of test accuracy.
Although our theoretical analysis is centered on binary classification tasks on simplified MoE architecture, our expert pruning method is verified on large vision MoE models.
arXiv Detail & Related papers (2024-05-26T17:52:58Z) - Performance Characterization of Expert Router for Scalable LLM Inference [0.4726677580049183]
Large Language Models (LLMs) have experienced widespread adoption across scientific and industrial domains.
deploying and serving these models at scale with optimal throughput and latency remains a significant challenge.
This paper introduces Expert Router, a scalable routing architecture that directs to specialized expert models.
arXiv Detail & Related papers (2024-04-22T16:33:42Z) - Routers in Vision Mixture of Experts: An Empirical Study [26.51711534240885]
Mixture-of-Experts (MoE) models are a promising way to scale up model capacity without significantly increasing computational cost.
Key component of MoEs is the router, which decides which subset of parameters (experts) process which feature embeddings (tokens)
arXiv Detail & Related papers (2024-01-29T08:58:07Z) - Robust Mixture-of-Expert Training for Convolutional Neural Networks [141.3531209949845]
Sparsely-gated Mixture of Expert (MoE) has demonstrated a great promise to enable high-accuracy and ultra-efficient model inference.
We propose a new router-expert alternating Adversarial training framework for MoE, termed AdvMoE.
We find that AdvMoE achieves 1% 4% adversarial robustness improvement over the original dense CNN, and enjoys the efficiency merit of sparsity-gated MoE.
arXiv Detail & Related papers (2023-08-19T20:58:21Z) - SMILE: Scaling Mixture-of-Experts with Efficient Bi-level Routing [47.11171833082974]
We introduce SMILE, which exploits heterogeneous network bandwidth and splits a single-step routing into bi-level routing.
Our experimental results show that the proposed method obtains a 2.5x speedup over Switch Transformer in terms of pretraining throughput on the Colossal Clean Crawled Corpus without losing any convergence speed.
arXiv Detail & Related papers (2022-12-10T03:44:16Z) - Semantic-aware Modular Capsule Routing for Visual Question Answering [55.03883681191765]
We propose a Semantic-aware modUlar caPsulE framework, termed as SUPER, to better capture the instance-specific vision-semantic characteristics.
We comparatively justify the effectiveness and generalization ability of our proposed SUPER scheme over five benchmark datasets.
arXiv Detail & Related papers (2022-07-21T10:48:37Z) - StableMoE: Stable Routing Strategy for Mixture of Experts [109.0602120199226]
Mixture-of-Experts (MoE) technique can scale up the model size of Transformers with an affordable computational overhead.
We propose StableMoE with two training stages to address the routing fluctuation problem.
Results show that StableMoE outperforms existing MoE methods in terms of both convergence speed and performance.
arXiv Detail & Related papers (2022-04-18T16:48:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.