SMILE: Scaling Mixture-of-Experts with Efficient Bi-level Routing
- URL: http://arxiv.org/abs/2212.05191v1
- Date: Sat, 10 Dec 2022 03:44:16 GMT
- Title: SMILE: Scaling Mixture-of-Experts with Efficient Bi-level Routing
- Authors: Chaoyang He, Shuai Zheng, Aston Zhang, George Karypis, Trishul
Chilimbi, Mahdi Soltanolkotabi, Salman Avestimehr
- Abstract summary: We introduce SMILE, which exploits heterogeneous network bandwidth and splits a single-step routing into bi-level routing.
Our experimental results show that the proposed method obtains a 2.5x speedup over Switch Transformer in terms of pretraining throughput on the Colossal Clean Crawled Corpus without losing any convergence speed.
- Score: 47.11171833082974
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The mixture of Expert (MoE) parallelism is a recent advancement that scales
up the model size with constant computational cost. MoE selects different sets
of parameters (i.e., experts) for each incoming token, resulting in a
sparsely-activated model. Despite several successful applications of MoE, its
training efficiency degrades significantly as the number of experts increases.
The routing stage in MoE relies on the efficiency of the All2All communication
collective, which suffers from network congestion and has poor scalability. To
mitigate these issues, we introduce SMILE, which exploits heterogeneous network
bandwidth and splits a single-step routing into bi-level routing. Our
experimental results show that the proposed method obtains a 2.5x speedup over
Switch Transformer in terms of pretraining throughput on the Colossal Clean
Crawled Corpus without losing any convergence speed.
Related papers
- M2R2: Mixture of Multi-Rate Residuals for Efficient Transformer Inference [8.792650582656913]
We introduce Mixture of Multi-rate Residuals (M2R2), a framework that dynamically modulates residual velocity to improve early alignment.
M2R2 surpasses state-of-the-art distance-based strategies, balancing generation quality and speedup.
In self-speculative decoding setup, M2R2 achieves up to 2.8x speedups on MT-Bench.
arXiv Detail & Related papers (2025-02-04T06:13:52Z) - ReMoE: Fully Differentiable Mixture-of-Experts with ReLU Routing [28.736973273162675]
Sparsely activated Mixture-of-Experts (MoE) models are widely adopted to scale up model capacity without increasing the computation budget.
We propose ReMoE, a fully differentiable MoE architecture that offers a simple yet effective drop-in replacement for the conventional TopK+Softmax routing.
ReMoE consistently outperforms vanilla TopK-routed MoE across various model sizes, expert counts, and levels of granularity.
arXiv Detail & Related papers (2024-12-19T10:21:20Z) - Layerwise Recurrent Router for Mixture-of-Experts [42.36093735411238]
Mixture-of-Experts (MoE) architecture stands out for its ability to scale model size without significantly increasing training costs.
Current MoE models often display parameter inefficiency.
We introduce the Layerwise Recurrent Router for Mixture-of-Experts (RMoE)
arXiv Detail & Related papers (2024-08-13T10:25:13Z) - Learning-to-Cache: Accelerating Diffusion Transformer via Layer Caching [56.286064975443026]
We make an interesting and somehow surprising observation: the computation of a large proportion of layers in the diffusion transformer, through a caching mechanism, can be readily removed even without updating the model parameters.
We introduce a novel scheme, named Learningto-Cache (L2C), that learns to conduct caching in a dynamic manner for diffusion transformers.
Experimental results show that L2C largely outperforms samplers such as DDIM and DPM-r, alongside prior cache-based methods at the same inference speed.
arXiv Detail & Related papers (2024-06-03T18:49:57Z) - LocMoE: A Low-Overhead MoE for Large Language Model Training [13.153904674287546]
We propose a novel routing strategy that combines load balance and locality by converting partial inter-node communication to that of intra-node.
The proposed LocMoE reduces training time per epoch by 12.68% to 22.24% compared to classical routers.
arXiv Detail & Related papers (2024-01-25T03:36:39Z) - Patch-level Routing in Mixture-of-Experts is Provably Sample-efficient
for Convolutional Neural Networks [74.68583356645276]
In deep learning, mixture-of-experts (MoE) activates one or few experts (sub-networks) on a per-sample or per-token basis.
We show for the first time that pMoE provably reduces the required number of training samples to achieve desirable generalization.
arXiv Detail & Related papers (2023-06-07T00:16:10Z) - Gating Dropout: Communication-efficient Regularization for Sparsely
Activated Transformers [78.77361169167149]
We propose emphGating Dropout, which allows tokens to ignore the gating network and stay at their local machines.
Similar to traditional dropout, we also show that Gating Dropout has a regularization effect during training, resulting in improved generalization performance.
arXiv Detail & Related papers (2022-05-28T05:12:43Z) - StableMoE: Stable Routing Strategy for Mixture of Experts [109.0602120199226]
Mixture-of-Experts (MoE) technique can scale up the model size of Transformers with an affordable computational overhead.
We propose StableMoE with two training stages to address the routing fluctuation problem.
Results show that StableMoE outperforms existing MoE methods in terms of both convergence speed and performance.
arXiv Detail & Related papers (2022-04-18T16:48:19Z) - Low-Latency Federated Learning over Wireless Channels with Differential
Privacy [142.5983499872664]
In federated learning (FL), model training is distributed over clients and local models are aggregated by a central server.
In this paper, we aim to minimize FL training delay over wireless channels, constrained by overall training performance as well as each client's differential privacy (DP) requirement.
arXiv Detail & Related papers (2021-06-20T13:51:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.