SMILE: Scaling Mixture-of-Experts with Efficient Bi-level Routing
- URL: http://arxiv.org/abs/2212.05191v1
- Date: Sat, 10 Dec 2022 03:44:16 GMT
- Title: SMILE: Scaling Mixture-of-Experts with Efficient Bi-level Routing
- Authors: Chaoyang He, Shuai Zheng, Aston Zhang, George Karypis, Trishul
Chilimbi, Mahdi Soltanolkotabi, Salman Avestimehr
- Abstract summary: We introduce SMILE, which exploits heterogeneous network bandwidth and splits a single-step routing into bi-level routing.
Our experimental results show that the proposed method obtains a 2.5x speedup over Switch Transformer in terms of pretraining throughput on the Colossal Clean Crawled Corpus without losing any convergence speed.
- Score: 47.11171833082974
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The mixture of Expert (MoE) parallelism is a recent advancement that scales
up the model size with constant computational cost. MoE selects different sets
of parameters (i.e., experts) for each incoming token, resulting in a
sparsely-activated model. Despite several successful applications of MoE, its
training efficiency degrades significantly as the number of experts increases.
The routing stage in MoE relies on the efficiency of the All2All communication
collective, which suffers from network congestion and has poor scalability. To
mitigate these issues, we introduce SMILE, which exploits heterogeneous network
bandwidth and splits a single-step routing into bi-level routing. Our
experimental results show that the proposed method obtains a 2.5x speedup over
Switch Transformer in terms of pretraining throughput on the Colossal Clean
Crawled Corpus without losing any convergence speed.
Related papers
- Router-Tuning: A Simple and Effective Approach for Enabling Dynamic-Depth in Transformers [40.40923861822689]
Mixture of Depths (MoD) was introduced to dynamically adjust the computational depth by skipping less important layers.
Despite its promise, current MoD approaches remain under-explored and face two main challenges.
We propose Router-Tuning, a method that fine-tunes only the router on a small dataset, drastically reducing the computational overhead associated with full model training.
For the second challenge, we propose MindSkip, which deploys textitAttention with Dynamic Depths. This method preserves the model's performance while significantly enhancing computational and memory efficiency.
arXiv Detail & Related papers (2024-10-17T03:23:50Z) - Layerwise Recurrent Router for Mixture-of-Experts [42.36093735411238]
Mixture-of-Experts (MoE) architecture stands out for its ability to scale model size without significantly increasing training costs.
Current MoE models often display parameter inefficiency.
We introduce the Layerwise Recurrent Router for Mixture-of-Experts (RMoE)
arXiv Detail & Related papers (2024-08-13T10:25:13Z) - Learning-to-Cache: Accelerating Diffusion Transformer via Layer Caching [56.286064975443026]
We make an interesting and somehow surprising observation: the computation of a large proportion of layers in the diffusion transformer, through a caching mechanism, can be readily removed even without updating the model parameters.
We introduce a novel scheme, named Learningto-Cache (L2C), that learns to conduct caching in a dynamic manner for diffusion transformers.
Experimental results show that L2C largely outperforms samplers such as DDIM and DPM-r, alongside prior cache-based methods at the same inference speed.
arXiv Detail & Related papers (2024-06-03T18:49:57Z) - LocMoE: A Low-Overhead MoE for Large Language Model Training [13.153904674287546]
We propose a novel routing strategy that combines load balance and locality by converting partial inter-node communication to that of intra-node.
The proposed LocMoE reduces training time per epoch by 12.68% to 22.24% compared to classical routers.
arXiv Detail & Related papers (2024-01-25T03:36:39Z) - Patch-level Routing in Mixture-of-Experts is Provably Sample-efficient
for Convolutional Neural Networks [74.68583356645276]
In deep learning, mixture-of-experts (MoE) activates one or few experts (sub-networks) on a per-sample or per-token basis.
We show for the first time that pMoE provably reduces the required number of training samples to achieve desirable generalization.
arXiv Detail & Related papers (2023-06-07T00:16:10Z) - Tutel: Adaptive Mixture-of-Experts at Scale [20.036168971435306]
Sparsely-gated mixture-of-experts (MoE) has been widely adopted to scale deep learning models to trillion-plus parameters with fixed computational cost.
We present Flex, a highly scalable stack design and implementation for MoE with dynamically adaptive parallelism and pipelining.
Our evaluation shows that Flex efficiently and effectively runs a real-world MoE-based model named SwinV2-MoE, built upon Swin Transformer V2, a state-of-the-art computer vision architecture.
arXiv Detail & Related papers (2022-06-07T15:20:20Z) - Gating Dropout: Communication-efficient Regularization for Sparsely
Activated Transformers [78.77361169167149]
We propose emphGating Dropout, which allows tokens to ignore the gating network and stay at their local machines.
Similar to traditional dropout, we also show that Gating Dropout has a regularization effect during training, resulting in improved generalization performance.
arXiv Detail & Related papers (2022-05-28T05:12:43Z) - StableMoE: Stable Routing Strategy for Mixture of Experts [109.0602120199226]
Mixture-of-Experts (MoE) technique can scale up the model size of Transformers with an affordable computational overhead.
We propose StableMoE with two training stages to address the routing fluctuation problem.
Results show that StableMoE outperforms existing MoE methods in terms of both convergence speed and performance.
arXiv Detail & Related papers (2022-04-18T16:48:19Z) - Low-Latency Federated Learning over Wireless Channels with Differential
Privacy [142.5983499872664]
In federated learning (FL), model training is distributed over clients and local models are aggregated by a central server.
In this paper, we aim to minimize FL training delay over wireless channels, constrained by overall training performance as well as each client's differential privacy (DP) requirement.
arXiv Detail & Related papers (2021-06-20T13:51:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.