Layerwise Recurrent Router for Mixture-of-Experts
- URL: http://arxiv.org/abs/2408.06793v1
- Date: Tue, 13 Aug 2024 10:25:13 GMT
- Title: Layerwise Recurrent Router for Mixture-of-Experts
- Authors: Zihan Qiu, Zeyu Huang, Shuang Cheng, Yizhi Zhou, Zili Wang, Ivan Titov, Jie Fu,
- Abstract summary: Mixture-of-Experts (MoE) architecture stands out for its ability to scale model size without significantly increasing training costs.
Current MoE models often display parameter inefficiency.
We introduce the Layerwise Recurrent Router for Mixture-of-Experts (RMoE)
- Score: 42.36093735411238
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The scaling of large language models (LLMs) has revolutionized their capabilities in various tasks, yet this growth must be matched with efficient computational strategies. The Mixture-of-Experts (MoE) architecture stands out for its ability to scale model size without significantly increasing training costs. Despite their advantages, current MoE models often display parameter inefficiency. For instance, a pre-trained MoE-based LLM with 52 billion parameters might perform comparably to a standard model with 6.7 billion parameters. Being a crucial part of MoE, current routers in different layers independently assign tokens without leveraging historical routing information, potentially leading to suboptimal token-expert combinations and the parameter inefficiency problem. To alleviate this issue, we introduce the Layerwise Recurrent Router for Mixture-of-Experts (RMoE). RMoE leverages a Gated Recurrent Unit (GRU) to establish dependencies between routing decisions across consecutive layers. Such layerwise recurrence can be efficiently parallelly computed for input tokens and introduces negotiable costs. Our extensive empirical evaluations demonstrate that RMoE-based language models consistently outperform a spectrum of baseline models. Furthermore, RMoE integrates a novel computation stage orthogonal to existing methods, allowing seamless compatibility with other MoE architectures. Our analyses attribute RMoE's gains to its effective cross-layer information sharing, which also improves expert selection and diversity. Our code is at https://github.com/qiuzh20/RMoE
Related papers
- Read-ME: Refactorizing LLMs as Router-Decoupled Mixture of Experts with System Co-Design [59.00758127310582]
We propose a novel framework Read-ME that transforms pre-trained dense LLMs into smaller MoE models.
Our approach employs activation sparsity to extract experts.
Read-ME outperforms other popular open-source dense models of similar scales.
arXiv Detail & Related papers (2024-10-24T19:48:51Z) - EPS-MoE: Expert Pipeline Scheduler for Cost-Efficient MoE Inference [49.94169109038806]
This paper introduces EPS-MoE, a novel expert pipeline scheduler for MoE.
Our results demonstrate an average 21% improvement in prefill throughput over existing parallel inference methods.
arXiv Detail & Related papers (2024-10-16T05:17:49Z) - Ada-K Routing: Boosting the Efficiency of MoE-based LLMs [6.954735360168147]
We propose a novel Ada-K routing strategy that dynamically adjusts the number of activated experts for each token.
Our strategy incorporates learnable and lightweight allocator modules that decide customized expert resource allocation tailored to the contextual needs for each token.
arXiv Detail & Related papers (2024-10-14T12:50:04Z) - Duo-LLM: A Framework for Studying Adaptive Computation in Large Language Models [16.16372459671255]
Large Language Models (LLMs) typically generate outputs token by token using a fixed compute budget.
We propose a novel framework that integrates smaller auxiliary modules within each Feed-Forward Network layer of the LLM.
We show that trained routers operate differently from oracles and often yield suboptimal solutions.
arXiv Detail & Related papers (2024-10-01T16:10:21Z) - Reference Trustable Decoding: A Training-Free Augmentation Paradigm for Large Language Models [79.41139393080736]
Large language models (LLMs) have rapidly advanced and demonstrated impressive capabilities.
We propose Reference Trustable Decoding (RTD), a paradigm that allows models to quickly adapt to new tasks without fine-tuning.
arXiv Detail & Related papers (2024-09-30T10:48:20Z) - Unchosen Experts Can Contribute Too: Unleashing MoE Models' Power by Self-Contrast [58.98411447739218]
Mixture-of-Experts (MoE) has emerged as a prominent architecture for scaling model size while maintaining computational efficiency.
We propose Self-Contrast Mixture-of-Experts (SCMoE), a training-free strategy that utilizes unchosen experts in a self-contrast manner during inference.
Our method is conceptually simple and computationally lightweight, as it incurs minimal latency compared to greedy decoding.
arXiv Detail & Related papers (2024-05-23T12:45:29Z) - XMoE: Sparse Models with Fine-grained and Adaptive Expert Selection [30.687511115573038]
tool is a novel MoE designed to enhance both the efficacy and efficiency of sparse MoE models.
tool can enhance model performance while decreasing the computation load at MoE layers by over 50% without sacrificing performance.
arXiv Detail & Related papers (2024-02-27T08:18:02Z) - Scalable and Efficient MoE Training for Multitask Multilingual Models [55.987536562357086]
We develop a system capable of scaling MoE models efficiently to trillions of parameters.
We also present new training methods to improve MoE sample efficiency and leverage expert pruning strategy to improve time efficiency.
A model trained with 10 billion parameters on 50 languages can achieve state-of-the-art performance in Machine Translation (MT) and multilingual natural language generation tasks.
arXiv Detail & Related papers (2021-09-22T00:57:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.