ReMoE: Fully Differentiable Mixture-of-Experts with ReLU Routing
- URL: http://arxiv.org/abs/2412.14711v2
- Date: Thu, 27 Feb 2025 16:33:09 GMT
- Title: ReMoE: Fully Differentiable Mixture-of-Experts with ReLU Routing
- Authors: Ziteng Wang, Jun Zhu, Jianfei Chen,
- Abstract summary: Sparsely activated Mixture-of-Experts (MoE) models are widely adopted to scale up model capacity without increasing the computation budget.<n>We propose ReMoE, a fully differentiable MoE architecture that offers a simple yet effective drop-in replacement for the conventional TopK+Softmax routing.<n>ReMoE consistently outperforms vanilla TopK-routed MoE across various model sizes, expert counts, and levels of granularity.
- Score: 28.73697327316267
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Sparsely activated Mixture-of-Experts (MoE) models are widely adopted to scale up model capacity without increasing the computation budget. However, vanilla TopK routers are trained in a discontinuous, non-differentiable way, limiting their performance and scalability. To address this issue, we propose ReMoE, a fully differentiable MoE architecture that offers a simple yet effective drop-in replacement for the conventional TopK+Softmax routing, utilizing ReLU as the router instead. We further propose methods to regulate the router's sparsity while balancing the load among experts. ReMoE's continuous nature enables efficient dynamic allocation of computation across tokens and layers, while also exhibiting domain specialization. Our experiments demonstrate that ReMoE consistently outperforms vanilla TopK-routed MoE across various model sizes, expert counts, and levels of granularity. Furthermore, ReMoE exhibits superior scalability with respect to the number of experts, surpassing traditional MoE architectures. The implementation based on Megatron-LM is available at https://github.com/thu-ml/ReMoE.
Related papers
- Mixture of Group Experts for Learning Invariant Representations [25.935653652324532]
Sparsely activated Mixture-of-Experts (MoE) models effectively increase the number of parameters while maintaining consistent computational costs per token.
We present a novel perspective on vanilla MoE with top-$k$ routing inspired by sparse representation.
We propose a group sparse regularization approach for the input of top-$k$ routing, termed Mixture of Group Experts (MoGE)
arXiv Detail & Related papers (2025-04-12T15:58:02Z) - Mixture of Routers [4.248666380057258]
We propose an efficient fine-tuning method called Mixture of Routers (MoR)
MoR uses multiple sub-routers for joint selection and uses a learnable main router to determine the weights of the sub-routers.
Results show that MoR outperforms baseline models on most tasks, achieving an average performance improvement of 1%.
arXiv Detail & Related papers (2025-03-30T08:39:09Z) - ResMoE: Space-efficient Compression of Mixture of Experts LLMs via Residual Restoration [61.579842548990754]
Mixture-of-Experts (MoE) Transformer, the backbone of multiple phenomenal language models, leverages sparsity by activating only a fraction of model parameters for each input token.
We introduce ResMoE, an innovative MoE approximation framework that utilizes Wasserstein barycenter to extract a common expert (barycenter expert) and approximate the residuals between this barycenter expert and the original ones.
arXiv Detail & Related papers (2025-03-10T03:15:54Z) - Read-ME: Refactorizing LLMs as Router-Decoupled Mixture of Experts with System Co-Design [59.00758127310582]
We propose a novel framework Read-ME that transforms pre-trained dense LLMs into smaller MoE models.
Our approach employs activation sparsity to extract experts.
Read-ME outperforms other popular open-source dense models of similar scales.
arXiv Detail & Related papers (2024-10-24T19:48:51Z) - Layerwise Recurrent Router for Mixture-of-Experts [42.36093735411238]
Mixture-of-Experts (MoE) architecture stands out for its ability to scale model size without significantly increasing training costs.
Current MoE models often display parameter inefficiency.
We introduce the Layerwise Recurrent Router for Mixture-of-Experts (RMoE)
arXiv Detail & Related papers (2024-08-13T10:25:13Z) - A Provably Effective Method for Pruning Experts in Fine-tuned Sparse Mixture-of-Experts [49.394145046409044]
This paper provides the first provably efficient technique for pruning experts in finetuned MoE models.
We theoretically prove that prioritizing the pruning of the experts with a smaller change of the routers l2 norm from the pretrained model guarantees the preservation of test accuracy.
Although our theoretical analysis is centered on binary classification tasks on simplified MoE architecture, our expert pruning method is verified on large vision MoE models.
arXiv Detail & Related papers (2024-05-26T17:52:58Z) - Unchosen Experts Can Contribute Too: Unleashing MoE Models' Power by Self-Contrast [58.98411447739218]
Mixture-of-Experts (MoE) has emerged as a prominent architecture for scaling model size while maintaining computational efficiency.
We propose Self-Contrast Mixture-of-Experts (SCMoE), a training-free strategy that utilizes unchosen experts in a self-contrast manner during inference.
Our method is conceptually simple and computationally lightweight, as it incurs minimal latency compared to greedy decoding.
arXiv Detail & Related papers (2024-05-23T12:45:29Z) - Routers in Vision Mixture of Experts: An Empirical Study [26.51711534240885]
Mixture-of-Experts (MoE) models are a promising way to scale up model capacity without significantly increasing computational cost.
Key component of MoEs is the router, which decides which subset of parameters (experts) process which feature embeddings (tokens)
arXiv Detail & Related papers (2024-01-29T08:58:07Z) - LocMoE: A Low-Overhead MoE for Large Language Model Training [13.153904674287546]
We propose a novel routing strategy that combines load balance and locality by converting partial inter-node communication to that of intra-node.
The proposed LocMoE reduces training time per epoch by 12.68% to 22.24% compared to classical routers.
arXiv Detail & Related papers (2024-01-25T03:36:39Z) - SMILE: Scaling Mixture-of-Experts with Efficient Bi-level Routing [47.11171833082974]
We introduce SMILE, which exploits heterogeneous network bandwidth and splits a single-step routing into bi-level routing.
Our experimental results show that the proposed method obtains a 2.5x speedup over Switch Transformer in terms of pretraining throughput on the Colossal Clean Crawled Corpus without losing any convergence speed.
arXiv Detail & Related papers (2022-12-10T03:44:16Z) - AutoMoE: Heterogeneous Mixture-of-Experts with Adaptive Computation for
Efficient Neural Machine Translation [104.0979785739202]
Mixture-of-Expert (MoE) models have obtained state-of-the-art performance in Neural Machine Translation (NMT) tasks.
Existing MoE models mostly consider a homogeneous design where the same number of experts of the same size are placed uniformly throughout the network.
We develop AutoMoE -- a framework for designing heterogeneous MoE's under computational constraints.
arXiv Detail & Related papers (2022-10-14T05:32:17Z) - SpeechMoE: Scaling to Large Acoustic Models with Dynamic Routing Mixture
of Experts [29.582683923988203]
Mixture of Experts (MoE) based Transformer has shown promising results in many domains.
In this work, we explore the MoE based model for speech recognition, named SpeechMoE.
New router architecture is used in SpeechMoE which can simultaneously utilize the information from a shared embedding network.
arXiv Detail & Related papers (2021-05-07T02:38:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.