SuperScaler: Supporting Flexible DNN Parallelization via a Unified
Abstraction
- URL: http://arxiv.org/abs/2301.08984v1
- Date: Sat, 21 Jan 2023 17:47:55 GMT
- Title: SuperScaler: Supporting Flexible DNN Parallelization via a Unified
Abstraction
- Authors: Zhiqi Lin, Youshan Miao, Guodong Liu, Xiaoxiang Shi, Quanlu Zhang, Fan
Yang, Saeed Maleki, Yi Zhu, Xu Cao, Cheng Li, Mao Yang, Lintao Zhang, Lidong
Zhou
- Abstract summary: SuperScaler is a system that facilitates the design and generation of flexible parallelization plans.
It formulates the plan design and generation into three sequential phases explicitly: model transformation, space-time scheduling, and data dependency preserving.
As a result, SuperScaler can not only generate empirical parallelization plans, but also construct new plans that achieve up to 3.5X speedup.
- Score: 17.82865339337427
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: With the growing model size, deep neural networks (DNN) are increasingly
trained over massive GPU accelerators, which demands a proper parallelization
plan that transforms a DNN model into fine-grained tasks and then schedules
them to GPUs for execution. Due to the large search space, the contemporary
parallelization plan generators often rely on empirical rules that couple
transformation and scheduling, and fall short in exploring more flexible
schedules that yield better memory usage and compute efficiency. This tension
can be exacerbated by the emerging models with increasing complexity in their
structure and model size. SuperScaler is a system that facilitates the design
and generation of highly flexible parallelization plans. It formulates the plan
design and generation into three sequential phases explicitly: model
transformation, space-time scheduling, and data dependency preserving. Such a
principled approach decouples multiple seemingly intertwined factors and
enables the composition of highly flexible parallelization plans. As a result,
SuperScaler can not only generate empirical parallelization plans, but also
construct new plans that achieve up to 3.5X speedup compared to
state-of-the-art solutions like DeepSpeed, Megatron and Alpa, for emerging DNN
models like Swin-Transformer and AlphaFold2, as well as well-optimized models
like GPT-3.
Related papers
- FusionLLM: A Decentralized LLM Training System on Geo-distributed GPUs with Adaptive Compression [55.992528247880685]
Decentralized training faces significant challenges regarding system design and efficiency.
We present FusionLLM, a decentralized training system designed and implemented for training large deep neural networks (DNNs)
We show that our system and method can achieve 1.45 - 9.39x speedup compared to baseline methods while ensuring convergence.
arXiv Detail & Related papers (2024-10-16T16:13:19Z) - Retentive Network: A Successor to Transformer for Large Language Models [91.6652200825638]
We propose Retentive Network (RetNet) as a foundation architecture for large language models.
We theoretically derive the connection between recurrence and attention.
Experimental results on language modeling show that RetNet achieves favorable scaling results, parallel training, low-cost deployment, and efficient inference.
arXiv Detail & Related papers (2023-07-17T16:40:01Z) - Learning Versatile 3D Shape Generation with Improved AR Models [91.87115744375052]
Auto-regressive (AR) models have achieved impressive results in 2D image generation by modeling joint distributions in the grid space.
We propose the Improved Auto-regressive Model (ImAM) for 3D shape generation, which applies discrete representation learning based on a latent vector instead of volumetric grids.
arXiv Detail & Related papers (2023-03-26T12:03:18Z) - Galvatron: Efficient Transformer Training over Multiple GPUs Using
Automatic Parallelism [25.928940638269534]
We propose Galvatron, a framework that automatically finds the most efficient hybrid parallelism strategy.
Galvatron always achieves superior system throughput compared to previous work with limited parallelism.
arXiv Detail & Related papers (2022-11-25T03:45:31Z) - On Optimizing the Communication of Model Parallelism [74.15423270435949]
We study a novel and important communication pattern in large-scale model-parallel deep learning (DL)
In cross-mesh resharding, a sharded tensor needs to be sent from a source device mesh to a destination device mesh.
We propose two contributions to address cross-mesh resharding: an efficient broadcast-based communication system, and an "overlapping-friendly" pipeline schedule.
arXiv Detail & Related papers (2022-11-10T03:56:48Z) - Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed
Deep Learning [54.99749970495241]
Alpa automates model-parallel training of large deep learning (DL) models.
Alpa generates execution plans that unify data, operator, and pipeline parallelism.
Unlike specialized systems, Alpa also generalizes to models with heterogeneous architectures and models without manually-designed plans.
arXiv Detail & Related papers (2022-01-28T10:13:35Z) - SplitBrain: Hybrid Data and Model Parallel Deep Learning [11.63431725146897]
This paper presents SplitBrain, a high performance distributed deep learning framework supporting hybrid data and model parallelism.
Specifically, SplitBrain provides layer-specific partitioning that co-locates compute intensive convolutional layers while sharding memory demanding layers.
Results show that SplitBrain can achieve nearly linear speedup while saving up to 67% of memory consumption for data and model parallel VGG over CIFAR-10.
arXiv Detail & Related papers (2021-12-31T06:25:38Z) - Automatic Graph Partitioning for Very Large-scale Deep Learning [4.472135966077758]
This work proposes RaNNC (Rapid Neural Network Connector) as for automatic hybrid parallelism.
RaNNC automatically partitions the model into a set of subcomponents so that each subcomponent fits a device memory.
RaNNC successfully trained models five times larger than those Megatron-LM could, and RaNNC's training throughputs were comparable to Megatron-LM's when pre-training the same models.
arXiv Detail & Related papers (2021-03-30T04:26:04Z) - TeraPipe: Token-Level Pipeline Parallelism for Training Large-Scale
Language Models [60.23234205219347]
TeraPipe is a high-performance token-level pipeline parallel algorithm for synchronous model-parallel training of Transformer-based language models.
We show that TeraPipe can speed up the training by 5.0x for the largest GPT-3 model with 175 billion parameters on an AWS cluster.
arXiv Detail & Related papers (2021-02-16T07:34:32Z) - The Case for Strong Scaling in Deep Learning: Training Large 3D CNNs
with Hybrid Parallelism [3.4377970608678314]
We present scalable hybrid-parallel algorithms for training large-scale 3D convolutional neural networks.
We evaluate our proposed training algorithms with two challenging 3D CNNs, CosmoFlow and 3D U-Net.
arXiv Detail & Related papers (2020-07-25T05:06:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.