MoNTA: Accelerating Mixture-of-Experts Training with Network-Traffc-Aware Parallel Optimization
- URL: http://arxiv.org/abs/2411.00662v1
- Date: Fri, 01 Nov 2024 15:27:20 GMT
- Title: MoNTA: Accelerating Mixture-of-Experts Training with Network-Traffc-Aware Parallel Optimization
- Authors: Jingming Guo, Yan Liu, Yu Meng, Zhiwei Tao, Banglan Liu, Gang Chen, Xiang Li,
- Abstract summary: Mixture of Experts (MoE) is an advanced model architecture in the industry that combines multiple specialized expert models into a single supermodel.
This paper proposes a network-traffic-aware parallel optimization method that selects the optimal parallel strategy based on the communication volume.
- Score: 21.993498492979672
- License:
- Abstract: The Mixture of Experts (MoE) is an advanced model architecture in the industry that combines multiple specialized expert models from various domains into a single supermodel. This approach enables the model to scale without significantly increasing the computational costs of training and inference, while maximizing model performance. However, current distributed training frameworks do not consider the ultimate optimization of communication, especially for large base models. This paper proposes a network-traffic-aware parallel optimization method that selects the optimal parallel strategy based on the communication volume, and the training cluster's inter-node and intra-node network topologies. Compared to the DeepSpeed, MoNTA achieves an 8x increase in AllToAll communication performance under 8-card tensor parallelism. Compared to the baseline, training a 2x70B model using 16 A800 cards, with an 8K sequence, results in a 13% overall latency performance improvement. Project Page: https://github.com/EnflameTechnology/DeepSpeed.
Related papers
- EPS-MoE: Expert Pipeline Scheduler for Cost-Efficient MoE Inference [49.94169109038806]
This paper introduces EPS-MoE, a novel expert pipeline scheduler for MoE.
Our results demonstrate an average 21% improvement in prefill throughput over existing parallel inference methods.
arXiv Detail & Related papers (2024-10-16T05:17:49Z) - Parm: Efficient Training of Large Sparsely-Activated Models with Dedicated Schedules [15.680276212483292]
We propose Parm, a system that accelerates MP+EP+ESP training by designing two dedicated schedules for placing communication tasks.
Parm achieves 1.13$times$ to 5.77$times$ speedup on 1296 manually configured MoE layers and approximately 3$times$ improvement on two real-world MoE models.
arXiv Detail & Related papers (2024-06-30T05:55:11Z) - ATOM: Asynchronous Training of Massive Models for Deep Learning in a Decentralized Environment [7.916080032572087]
atom is a resilient distributed training framework designed for asynchronous training of vast models in a decentralized setting.
atom aims to accommodate a complete LLM on one host (peer) through seamlessly model swapping and concurrently trains multiple copies across various peers to optimize training throughput.
Our experiments using different GPT-3 model configurations reveal that, in scenarios with suboptimal network connections, atom can enhance training efficiency up to $20 times$ when juxtaposed with the state-of-the-art decentralized pipeline parallelism approaches.
arXiv Detail & Related papers (2024-03-15T17:43:43Z) - A Hybrid Tensor-Expert-Data Parallelism Approach to Optimize
Mixture-of-Experts Training [13.346719319555943]
Mixture-of-Experts (MoE) is a neural network architecture that adds sparsely activated expert blocks to a base model.
Current distributed deep learning frameworks are limited in their ability to train high-quality MoE models with large base models.
We present DeepSpeed-TED, a novel, three-dimensional, hybrid parallel algorithm that combines data, tensor, and expert parallelism.
arXiv Detail & Related papers (2023-03-11T05:38:15Z) - Slapo: A Schedule Language for Progressive Optimization of Large Deep
Learning Model Training [17.556432199389615]
Slapo is a schedule language that decouples the execution of a tensor-level operator from its arithmetic definition.
We show that Slapo can improve training throughput by up to 2.92x on a single machine with 8 NVIDIA V100 GPUs.
arXiv Detail & Related papers (2023-02-16T00:34:53Z) - SWARM Parallelism: Training Large Models Can Be Surprisingly
Communication-Efficient [69.61083127540776]
Deep learning applications benefit from using large models with billions of parameters.
Training these models is notoriously expensive due to the need for specialized HPC clusters.
We consider alternative setups for training large models: using cheap "preemptible" instances or pooling existing resources from multiple regions.
arXiv Detail & Related papers (2023-01-27T18:55:19Z) - On Optimizing the Communication of Model Parallelism [74.15423270435949]
We study a novel and important communication pattern in large-scale model-parallel deep learning (DL)
In cross-mesh resharding, a sharded tensor needs to be sent from a source device mesh to a destination device mesh.
We propose two contributions to address cross-mesh resharding: an efficient broadcast-based communication system, and an "overlapping-friendly" pipeline schedule.
arXiv Detail & Related papers (2022-11-10T03:56:48Z) - MoESys: A Distributed and Efficient Mixture-of-Experts Training and Inference System for Internet Services [32.278096820269816]
We present a novel MoESys that boosts efficiency in both large-scale training and inference.
Specifically, in the training procedure, the proposed MoESys adopts an Elastic MoE training strategy with 2D prefetch and Fusion communication over Hierarchical storage.
For scalable inference in a single node, MoESys builds the CPU-GPU memory jointly into a ring of sections to load the model, and executes the computation tasks across the memory sections in a round-robin manner for efficient inference.
arXiv Detail & Related papers (2022-05-20T09:09:27Z) - Simultaneous Training of Partially Masked Neural Networks [67.19481956584465]
We show that it is possible to train neural networks in such a way that a predefined 'core' subnetwork can be split-off from the trained full network with remarkable good performance.
We show that training a Transformer with a low-rank core gives a low-rank model with superior performance than when training the low-rank model alone.
arXiv Detail & Related papers (2021-06-16T15:57:51Z) - TeraPipe: Token-Level Pipeline Parallelism for Training Large-Scale
Language Models [60.23234205219347]
TeraPipe is a high-performance token-level pipeline parallel algorithm for synchronous model-parallel training of Transformer-based language models.
We show that TeraPipe can speed up the training by 5.0x for the largest GPT-3 model with 175 billion parameters on an AWS cluster.
arXiv Detail & Related papers (2021-02-16T07:34:32Z) - Scaling Distributed Deep Learning Workloads beyond the Memory Capacity
with KARMA [58.040931661693925]
We propose a strategy that combines redundant recomputing and out-of-core methods.
We achieve an average of 1.52x speedup in six different models over the state-of-the-art out-of-core methods.
Our data parallel out-of-core solution can outperform complex hybrid model parallelism in training large models, e.g. Megatron-LM and Turning-NLG.
arXiv Detail & Related papers (2020-08-26T07:24:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.