StableMoE: Stable Routing Strategy for Mixture of Experts
- URL: http://arxiv.org/abs/2204.08396v1
- Date: Mon, 18 Apr 2022 16:48:19 GMT
- Title: StableMoE: Stable Routing Strategy for Mixture of Experts
- Authors: Damai Dai, Li Dong, Shuming Ma, Bo Zheng, Zhifang Sui, Baobao Chang,
Furu Wei
- Abstract summary: Mixture-of-Experts (MoE) technique can scale up the model size of Transformers with an affordable computational overhead.
We propose StableMoE with two training stages to address the routing fluctuation problem.
Results show that StableMoE outperforms existing MoE methods in terms of both convergence speed and performance.
- Score: 109.0602120199226
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The Mixture-of-Experts (MoE) technique can scale up the model size of
Transformers with an affordable computational overhead. We point out that
existing learning-to-route MoE methods suffer from the routing fluctuation
issue, i.e., the target expert of the same input may change along with
training, but only one expert will be activated for the input during inference.
The routing fluctuation tends to harm sample efficiency because the same input
updates different experts but only one is finally used. In this paper, we
propose StableMoE with two training stages to address the routing fluctuation
problem. In the first training stage, we learn a balanced and cohesive routing
strategy and distill it into a lightweight router decoupled from the backbone
model. In the second training stage, we utilize the distilled router to
determine the token-to-expert assignment and freeze it for a stable routing
strategy. We validate our method on language modeling and multilingual machine
translation. The results show that StableMoE outperforms existing MoE methods
in terms of both convergence speed and performance.
Related papers
- iDAT: inverse Distillation Adapter-Tuning [15.485126287621439]
Adapter-Tuning (AT) method involves freezing a pre-trained model and introducing trainable adapter modules to acquire downstream knowledge.
This paper proposes a distillation framework for the AT method instead of crafting a carefully designed adapter module.
arXiv Detail & Related papers (2024-03-23T07:36:58Z) - Routers in Vision Mixture of Experts: An Empirical Study [26.51711534240885]
Mixture-of-Experts (MoE) models are a promising way to scale up model capacity without significantly increasing computational cost.
Key component of MoEs is the router, which decides which subset of parameters (experts) process which feature embeddings (tokens)
arXiv Detail & Related papers (2024-01-29T08:58:07Z) - LocMoE: A Low-Overhead MoE for Large Language Model Training [13.153904674287546]
We propose a novel routing strategy that combines load balance and locality by converting partial inter-node communication to that of intra-node.
The proposed LocMoE reduces training time per epoch by 12.68% to 22.24% compared to classical routers.
arXiv Detail & Related papers (2024-01-25T03:36:39Z) - Sparse Backpropagation for MoE Training [118.31785160874024]
We introduce SparseMixer, a scalable gradient estimator that bridges the gap between backpropagation and sparse expert routing.
Grounded in a numerical ODE framework, SparseMixer harnesses the mid-point method, a second-order ODE solver, to deliver precise gradient approximations.
Applying SparseMixer to Switch Transformer on both pre-training and machine translation tasks, SparseMixer showcases considerable performance gain.
arXiv Detail & Related papers (2023-10-01T22:43:57Z) - Patch-level Routing in Mixture-of-Experts is Provably Sample-efficient
for Convolutional Neural Networks [74.68583356645276]
In deep learning, mixture-of-experts (MoE) activates one or few experts (sub-networks) on a per-sample or per-token basis.
We show for the first time that pMoE provably reduces the required number of training samples to achieve desirable generalization.
arXiv Detail & Related papers (2023-06-07T00:16:10Z) - SMILE: Scaling Mixture-of-Experts with Efficient Bi-level Routing [47.11171833082974]
We introduce SMILE, which exploits heterogeneous network bandwidth and splits a single-step routing into bi-level routing.
Our experimental results show that the proposed method obtains a 2.5x speedup over Switch Transformer in terms of pretraining throughput on the Colossal Clean Crawled Corpus without losing any convergence speed.
arXiv Detail & Related papers (2022-12-10T03:44:16Z) - Taming Sparsely Activated Transformer with Stochastic Experts [76.0711573018493]
Sparsely activated models (SAMs) can easily scale to have outrageously large amounts of parameters without significant increase in computational cost.
In this paper, we propose a new expert-based model, THOR (Transformer witH StOchastic ExpeRts)
Unlike classic expert-based models, such as the Switch Transformer, experts in THOR are randomly activated for each input during training and inference.
arXiv Detail & Related papers (2021-10-08T17:15:47Z) - Hash Layers For Large Sparse Models [48.90784451703753]
We modify the feedforward layer to hash to different sets of weights depending on the current token, over all tokens in the sequence.
We show that this procedure either outperforms or is competitive with learning-to-route mixture-of-expert methods.
arXiv Detail & Related papers (2021-06-08T14:54:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.