StableMoE: Stable Routing Strategy for Mixture of Experts
- URL: http://arxiv.org/abs/2204.08396v1
- Date: Mon, 18 Apr 2022 16:48:19 GMT
- Title: StableMoE: Stable Routing Strategy for Mixture of Experts
- Authors: Damai Dai, Li Dong, Shuming Ma, Bo Zheng, Zhifang Sui, Baobao Chang,
Furu Wei
- Abstract summary: Mixture-of-Experts (MoE) technique can scale up the model size of Transformers with an affordable computational overhead.
We propose StableMoE with two training stages to address the routing fluctuation problem.
Results show that StableMoE outperforms existing MoE methods in terms of both convergence speed and performance.
- Score: 109.0602120199226
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The Mixture-of-Experts (MoE) technique can scale up the model size of
Transformers with an affordable computational overhead. We point out that
existing learning-to-route MoE methods suffer from the routing fluctuation
issue, i.e., the target expert of the same input may change along with
training, but only one expert will be activated for the input during inference.
The routing fluctuation tends to harm sample efficiency because the same input
updates different experts but only one is finally used. In this paper, we
propose StableMoE with two training stages to address the routing fluctuation
problem. In the first training stage, we learn a balanced and cohesive routing
strategy and distill it into a lightweight router decoupled from the backbone
model. In the second training stage, we utilize the distilled router to
determine the token-to-expert assignment and freeze it for a stable routing
strategy. We validate our method on language modeling and multilingual machine
translation. The results show that StableMoE outperforms existing MoE methods
in terms of both convergence speed and performance.
Related papers
- Router-Tuning: A Simple and Effective Approach for Enabling Dynamic-Depth in Transformers [40.40923861822689]
Mixture of Depths (MoD) was introduced to dynamically adjust the computational depth by skipping less important layers.
Despite its promise, current MoD approaches remain under-explored and face two main challenges.
We propose Router-Tuning, a method that fine-tunes only the router on a small dataset, drastically reducing the computational overhead associated with full model training.
For the second challenge, we propose MindSkip, which deploys textitAttention with Dynamic Depths. This method preserves the model's performance while significantly enhancing computational and memory efficiency.
arXiv Detail & Related papers (2024-10-17T03:23:50Z) - Layerwise Recurrent Router for Mixture-of-Experts [42.36093735411238]
Mixture-of-Experts (MoE) architecture stands out for its ability to scale model size without significantly increasing training costs.
Current MoE models often display parameter inefficiency.
We introduce the Layerwise Recurrent Router for Mixture-of-Experts (RMoE)
arXiv Detail & Related papers (2024-08-13T10:25:13Z) - MaskMoE: Boosting Token-Level Learning via Routing Mask in Mixture-of-Experts [38.15244333975921]
MaskMoE is capable of maintaining representation diversity while achieving more comprehensive training.
Our method outperforms previous dominant Mixture-of-Experts models in terms of both perplexity (PPL) and downstream task performance.
arXiv Detail & Related papers (2024-07-13T09:22:33Z) - LocMoE: A Low-Overhead MoE for Large Language Model Training [13.153904674287546]
We propose a novel routing strategy that combines load balance and locality by converting partial inter-node communication to that of intra-node.
The proposed LocMoE reduces training time per epoch by 12.68% to 22.24% compared to classical routers.
arXiv Detail & Related papers (2024-01-25T03:36:39Z) - Sparse Backpropagation for MoE Training [118.31785160874024]
We introduce SparseMixer, a scalable gradient estimator that bridges the gap between backpropagation and sparse expert routing.
Grounded in a numerical ODE framework, SparseMixer harnesses the mid-point method, a second-order ODE solver, to deliver precise gradient approximations.
Applying SparseMixer to Switch Transformer on both pre-training and machine translation tasks, SparseMixer showcases considerable performance gain.
arXiv Detail & Related papers (2023-10-01T22:43:57Z) - Patch-level Routing in Mixture-of-Experts is Provably Sample-efficient
for Convolutional Neural Networks [74.68583356645276]
In deep learning, mixture-of-experts (MoE) activates one or few experts (sub-networks) on a per-sample or per-token basis.
We show for the first time that pMoE provably reduces the required number of training samples to achieve desirable generalization.
arXiv Detail & Related papers (2023-06-07T00:16:10Z) - SMILE: Scaling Mixture-of-Experts with Efficient Bi-level Routing [47.11171833082974]
We introduce SMILE, which exploits heterogeneous network bandwidth and splits a single-step routing into bi-level routing.
Our experimental results show that the proposed method obtains a 2.5x speedup over Switch Transformer in terms of pretraining throughput on the Colossal Clean Crawled Corpus without losing any convergence speed.
arXiv Detail & Related papers (2022-12-10T03:44:16Z) - Taming Sparsely Activated Transformer with Stochastic Experts [76.0711573018493]
Sparsely activated models (SAMs) can easily scale to have outrageously large amounts of parameters without significant increase in computational cost.
In this paper, we propose a new expert-based model, THOR (Transformer witH StOchastic ExpeRts)
Unlike classic expert-based models, such as the Switch Transformer, experts in THOR are randomly activated for each input during training and inference.
arXiv Detail & Related papers (2021-10-08T17:15:47Z) - Hash Layers For Large Sparse Models [48.90784451703753]
We modify the feedforward layer to hash to different sets of weights depending on the current token, over all tokens in the sequence.
We show that this procedure either outperforms or is competitive with learning-to-route mixture-of-expert methods.
arXiv Detail & Related papers (2021-06-08T14:54:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.