Tutel: Adaptive Mixture-of-Experts at Scale
- URL: http://arxiv.org/abs/2206.03382v2
- Date: Mon, 5 Jun 2023 15:05:24 GMT
- Title: Tutel: Adaptive Mixture-of-Experts at Scale
- Authors: Changho Hwang, Wei Cui, Yifan Xiong, Ziyue Yang, Ze Liu, Han Hu,
Zilong Wang, Rafael Salas, Jithin Jose, Prabhat Ram, Joe Chau, Peng Cheng,
Fan Yang, Mao Yang, Yongqiang Xiong
- Abstract summary: Sparsely-gated mixture-of-experts (MoE) has been widely adopted to scale deep learning models to trillion-plus parameters with fixed computational cost.
We present Flex, a highly scalable stack design and implementation for MoE with dynamically adaptive parallelism and pipelining.
Our evaluation shows that Flex efficiently and effectively runs a real-world MoE-based model named SwinV2-MoE, built upon Swin Transformer V2, a state-of-the-art computer vision architecture.
- Score: 20.036168971435306
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Sparsely-gated mixture-of-experts (MoE) has been widely adopted to scale deep
learning models to trillion-plus parameters with fixed computational cost. The
algorithmic performance of MoE relies on its token routing mechanism that
forwards each input token to the right sub-models or experts. While token
routing dynamically determines the amount of expert workload at runtime,
existing systems suffer inefficient computation due to their static execution,
namely static parallelism and pipelining, which does not adapt to the dynamic
workload. We present Flex, a highly scalable stack design and implementation
for MoE with dynamically adaptive parallelism and pipelining. Flex designs an
identical layout for distributing MoE model parameters and input data, which
can be leveraged by all possible parallelism or pipelining methods without any
mathematical inequivalence or tensor migration overhead. This enables adaptive
parallelism/pipelining optimization at zero cost during runtime. Based on this
key design, Flex also implements various MoE acceleration techniques.
Aggregating all techniques, Flex finally delivers huge speedup at any scale --
4.96x and 5.75x speedup of a single MoE layer over 16 and 2,048 A100 GPUs,
respectively, over the previous state-of-the-art. Our evaluation shows that
Flex efficiently and effectively runs a real-world MoE-based model named
SwinV2-MoE, built upon Swin Transformer V2, a state-of-the-art computer vision
architecture. On efficiency, Flex accelerates SwinV2-MoE, achieving up to 1.55x
and 2.11x speedup in training and inference over Fairseq, respectively. On
effectiveness, the SwinV2-MoE model achieves superior accuracy in both
pre-training and down-stream computer vision tasks such as COCO object
detection than the counterpart dense model, indicating the readiness of Flex
for end-to-end real-world model training and inference.
Related papers
- EPS-MoE: Expert Pipeline Scheduler for Cost-Efficient MoE Inference [49.94169109038806]
This paper introduces EPS-MoE, a novel expert pipeline scheduler for MoE.
Our results demonstrate an average 21% improvement in prefill throughput over existing parallel inference methods.
arXiv Detail & Related papers (2024-10-16T05:17:49Z) - Flexiffusion: Segment-wise Neural Architecture Search for Flexible Denoising Schedule [50.260693393896716]
Diffusion models are cutting-edge generative models adept at producing diverse, high-quality images.
Recent techniques have been employed to automatically search for faster generation processes.
We introduce Flexiffusion, a novel training-free NAS paradigm designed to accelerate diffusion models.
arXiv Detail & Related papers (2024-09-26T06:28:05Z) - Flextron: Many-in-One Flexible Large Language Model [85.93260172698398]
We introduce Flextron, a network architecture and post-training model optimization framework supporting flexible model deployment.
We present a sample-efficient training method and associated routing algorithms for transforming an existing trained LLM into a Flextron model.
We demonstrate superior performance over multiple end-to-end trained variants and other state-of-the-art elastic networks, all with a single pretraining run that consumes a mere 7.63% tokens compared to original pretraining.
arXiv Detail & Related papers (2024-06-11T01:16:10Z) - Dynamic Adapter Meets Prompt Tuning: Parameter-Efficient Transfer Learning for Point Cloud Analysis [51.14136878142034]
Point cloud analysis has achieved outstanding performance by transferring point cloud pre-trained models.
Existing methods for model adaptation usually update all model parameters, which is inefficient as it relies on high computational costs.
In this paper, we aim to study parameter-efficient transfer learning for point cloud analysis with an ideal trade-off between task performance and parameter efficiency.
arXiv Detail & Related papers (2024-03-03T08:25:04Z) - Pipeline MoE: A Flexible MoE Implementation with Pipeline Parallelism [91.9372563527801]
Existing MoE models suffer from tremendous inner-node and inter-node communication overhead.
We propose a novel MoE architecture called Pipeline MoE (PPMoE) to tackle them.
PPMoE builds expert parallel incorporating with tensor parallel and replaces communication-intensive all-to-all dispatching and gathering.
arXiv Detail & Related papers (2023-04-22T14:09:14Z) - FlexMoE: Scaling Large-scale Sparse Pre-trained Model Training via
Dynamic Device Placement [19.639936387834677]
Mixture-of-Experts (MoEs) are becoming more popular and have demonstrated impressive pretraining scalability in various downstream tasks.
MoEs are becoming a new data analytics paradigm in the data life cycle and suffering from unique challenges at scales, complexities, and granularities never before possible.
In this paper, we propose a novel DNN training framework, FlexMoE, which systematically and transparently address the inefficiency caused by dynamic dataflow.
arXiv Detail & Related papers (2023-04-08T07:34:26Z) - SMILE: Scaling Mixture-of-Experts with Efficient Bi-level Routing [47.11171833082974]
We introduce SMILE, which exploits heterogeneous network bandwidth and splits a single-step routing into bi-level routing.
Our experimental results show that the proposed method obtains a 2.5x speedup over Switch Transformer in terms of pretraining throughput on the Colossal Clean Crawled Corpus without losing any convergence speed.
arXiv Detail & Related papers (2022-12-10T03:44:16Z) - Scaling Distributed Deep Learning Workloads beyond the Memory Capacity
with KARMA [58.040931661693925]
We propose a strategy that combines redundant recomputing and out-of-core methods.
We achieve an average of 1.52x speedup in six different models over the state-of-the-art out-of-core methods.
Our data parallel out-of-core solution can outperform complex hybrid model parallelism in training large models, e.g. Megatron-LM and Turning-NLG.
arXiv Detail & Related papers (2020-08-26T07:24:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.