SE-MoE: A Scalable and Efficient Mixture-of-Experts Distributed Training
and Inference System
- URL: http://arxiv.org/abs/2205.10034v2
- Date: Mon, 12 Jun 2023 12:07:22 GMT
- Title: SE-MoE: A Scalable and Efficient Mixture-of-Experts Distributed Training
and Inference System
- Authors: Liang Shen, Zhihua Wu, WeiBao Gong, Hongxiang Hao, Yangfan Bai,
HuaChao Wu, Xinxuan Wu, Jiang Bian, Haoyi Xiong, Dianhai Yu, Yanjun Ma
- Abstract summary: Mixture-of-Experts (MoE) models have been proposed to lower the cost of training subject to the overall size of models/data.
We present SE-MoE that proposes Elastic MoE training with 2D prefetch and Fusion communication over Hierarchical storage.
For scalable inference in a single node, especially when the model size is larger than GPU memory, SE-MoE forms the CPU-GPU memory jointly into a ring of sections to load the model, and executes the computation tasks across the memory sections in a round-robin manner for efficient inference.
- Score: 24.335267149209848
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: With the increasing diversity of ML infrastructures nowadays, distributed
training over heterogeneous computing systems is desired to facilitate the
production of big models. Mixture-of-Experts (MoE) models have been proposed to
lower the cost of training subject to the overall size of models/data through
gating and parallelism in a divide-and-conquer fashion. While DeepSpeed has
made efforts in carrying out large-scale MoE training over heterogeneous
infrastructures, the efficiency of training and inference could be further
improved from several system aspects, including load balancing,
communication/computation efficiency, and memory footprint limits. In this
work, we present SE-MoE that proposes Elastic MoE training with 2D prefetch and
Fusion communication over Hierarchical storage, so as to enjoy efficient
parallelisms in various types. For scalable inference in a single node,
especially when the model size is larger than GPU memory, SE-MoE forms the
CPU-GPU memory jointly into a ring of sections to load the model, and executes
the computation tasks across the memory sections in a round-robin manner for
efficient inference. We carried out extensive experiments to evaluate SE-MoE,
where SE-MoE successfully trains a Unified Feature Optimization (UFO) model
with a Sparsely-Gated Mixture-of-Experts model of 12B parameters in 8 days on
48 A100 GPU cards. The comparison against the state-of-the-art shows that
SE-MoE outperformed DeepSpeed with 33% higher throughput (tokens per second) in
training and 13% higher throughput in inference in general. Particularly, under
unbalanced MoE Tasks, e.g., UFO, SE-MoE achieved 64% higher throughput with 18%
lower memory footprints. The code of the framework will be released on:
https://github.com/PaddlePaddle/Paddle.
Related papers
- EPS-MoE: Expert Pipeline Scheduler for Cost-Efficient MoE Inference [49.94169109038806]
This paper introduces EPS-MoE, a novel expert pipeline scheduler for MoE.
Our results demonstrate an average 21% improvement in prefill throughput over existing parallel inference methods.
arXiv Detail & Related papers (2024-10-16T05:17:49Z) - Parm: Efficient Training of Large Sparsely-Activated Models with Dedicated Schedules [15.680276212483292]
We propose Parm, a system that accelerates MP+EP+ESP training by designing two dedicated schedules for placing communication tasks.
Parm achieves 1.13$times$ to 5.77$times$ speedup on 1296 manually configured MoE layers and approximately 3$times$ improvement on two real-world MoE models.
arXiv Detail & Related papers (2024-06-30T05:55:11Z) - Dense Training, Sparse Inference: Rethinking Training of Mixture-of-Experts Language Models [62.4691912312317]
Mixture-of-Experts (MoE) language models can reduce computational costs by 2-4$times$ compared to dense models without sacrificing performance.
We propose a hybrid dense training and sparse inference framework for MoE models (DS-MoE) which achieves strong computation and parameter efficiency.
arXiv Detail & Related papers (2024-04-08T14:39:49Z) - Toward Inference-optimal Mixture-of-Expert Large Language Models [55.96674056805708]
We study the scaling law of MoE-based large language models (LLMs)
We find that MoEs with a few (4/8) experts are the most serving efficient solution under the same performance, but costs 2.5-3.5x more in training.
We propose to amend the scaling law of MoE by introducing inference efficiency as another metric besides the validation loss.
arXiv Detail & Related papers (2024-04-03T16:33:42Z) - Partitioned Neural Network Training via Synthetic Intermediate Labels [0.0]
GPU memory constraints have become a notable bottleneck in training such sizable models.
This study advocates partitioning the model across GPU and generating synthetic intermediate labels to train individual segments.
This approach results in a more efficient training process that minimizes data communication while maintaining model accuracy.
arXiv Detail & Related papers (2024-03-17T13:06:29Z) - Efficient Parallelization Layouts for Large-Scale Distributed Model Training [17.16249954009967]
We conduct a comprehensive study of possible training configurations for large language models.
We find that using a micro-batch size of 1 usually enables the most efficient training layouts.
Our most efficient configurations enable us to achieve state-of-the-art training efficiency results over a range of model sizes.
arXiv Detail & Related papers (2023-11-09T18:59:38Z) - Harnessing Manycore Processors with Distributed Memory for Accelerated
Training of Sparse and Recurrent Models [43.1773057439246]
Current AI training infrastructure is dominated by single instruction multiple data (SIMD) and systolic array architectures.
We explore sparse and recurrent model training on a massively parallel multiple instruction multiple data architecture with distributed local memory.
arXiv Detail & Related papers (2023-11-07T23:18:35Z) - A Hybrid Tensor-Expert-Data Parallelism Approach to Optimize
Mixture-of-Experts Training [13.346719319555943]
Mixture-of-Experts (MoE) is a neural network architecture that adds sparsely activated expert blocks to a base model.
Current distributed deep learning frameworks are limited in their ability to train high-quality MoE models with large base models.
We present DeepSpeed-TED, a novel, three-dimensional, hybrid parallel algorithm that combines data, tensor, and expert parallelism.
arXiv Detail & Related papers (2023-03-11T05:38:15Z) - SWARM Parallelism: Training Large Models Can Be Surprisingly
Communication-Efficient [69.61083127540776]
Deep learning applications benefit from using large models with billions of parameters.
Training these models is notoriously expensive due to the need for specialized HPC clusters.
We consider alternative setups for training large models: using cheap "preemptible" instances or pooling existing resources from multiple regions.
arXiv Detail & Related papers (2023-01-27T18:55:19Z) - M6-10T: A Sharing-Delinking Paradigm for Efficient Multi-Trillion
Parameter Pretraining [55.16088793437898]
Training extreme-scale models requires enormous amounts of computes and memory footprint.
We propose a simple training strategy called "Pseudo-to-Real" for high-memory-footprint-required large models.
arXiv Detail & Related papers (2021-10-08T04:24:51Z) - Scalable and Efficient MoE Training for Multitask Multilingual Models [55.987536562357086]
We develop a system capable of scaling MoE models efficiently to trillions of parameters.
We also present new training methods to improve MoE sample efficiency and leverage expert pruning strategy to improve time efficiency.
A model trained with 10 billion parameters on 50 languages can achieve state-of-the-art performance in Machine Translation (MT) and multilingual natural language generation tasks.
arXiv Detail & Related papers (2021-09-22T00:57:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.