Related papers: SuperScaler: Supporting Flexible DNN Parallelization via a Unified Abstraction

SuperScaler: Supporting Flexible DNN Parallelization via a Unified Abstraction

URL: http://arxiv.org/abs/2301.08984v1
Date: Sat, 21 Jan 2023 17:47:55 GMT
Title: SuperScaler: Supporting Flexible DNN Parallelization via a Unified Abstraction
Authors: Zhiqi Lin, Youshan Miao, Guodong Liu, Xiaoxiang Shi, Quanlu Zhang, Fan Yang, Saeed Maleki, Yi Zhu, Xu Cao, Cheng Li, Mao Yang, Lintao Zhang, Lidong Zhou
Abstract summary: SuperScaler is a system that facilitates the design and generation of flexible parallelization plans. It formulates the plan design and generation into three sequential phases explicitly: model transformation, space-time scheduling, and data dependency preserving. As a result, SuperScaler can not only generate empirical parallelization plans, but also construct new plans that achieve up to 3.5X speedup.
Score: 17.82865339337427
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: With the growing model size, deep neural networks (DNN) are increasingly trained over massive GPU accelerators, which demands a proper parallelization plan that transforms a DNN model into fine-grained tasks and then schedules them to GPUs for execution. Due to the large search space, the contemporary parallelization plan generators often rely on empirical rules that couple transformation and scheduling, and fall short in exploring more flexible schedules that yield better memory usage and compute efficiency. This tension can be exacerbated by the emerging models with increasing complexity in their structure and model size. SuperScaler is a system that facilitates the design and generation of highly flexible parallelization plans. It formulates the plan design and generation into three sequential phases explicitly: model transformation, space-time scheduling, and data dependency preserving. Such a principled approach decouples multiple seemingly intertwined factors and enables the composition of highly flexible parallelization plans. As a result, SuperScaler can not only generate empirical parallelization plans, but also construct new plans that achieve up to 3.5X speedup compared to state-of-the-art solutions like DeepSpeed, Megatron and Alpa, for emerging DNN models like Swin-Transformer and AlphaFold2, as well as well-optimized models like GPT-3.

Related papers

Two-dimensional Sparse Parallelism for Large Scale Deep Learning Recommendation Model Training [9.47829333855806]
In deep learning recommendation models (DLRM), the sparse embedding table is a crucial component for managing sparse categorical features.<n>We propose a novel two-dimensional sparse parallelism approach to overcome scalability challenges.<n>We show that the proposed approach significantly enhances training efficiency while maintaining model performance parity.
arXiv Detail & Related papers (2025-08-05T19:12:18Z)
MoE Parallel Folding: Heterogeneous Parallelism Mappings for Efficient Large-Scale MoE Model Training with Megatron Core [11.40633051522406]
We introduce an end-to-end training framework for large-scale MoE models. MoE Parallel Folding is a novel strategy that decouples the parallelization of attention and MoE in Transformer models. A flexible token-level dispatcher supports both token-dropping and token-dropless MoE training.
arXiv Detail & Related papers (2025-04-21T08:39:47Z)
Geometric Algebra Planes: Convex Implicit Neural Volumes [70.12234371845445]
We show that GA-Planes is equivalent to a sparse low-rank factor plus low-resolution matrix. We also show that GA-Planes can be adapted for many existing representations.
arXiv Detail & Related papers (2024-11-20T18:21:58Z)
FusionLLM: A Decentralized LLM Training System on Geo-distributed GPUs with Adaptive Compression [55.992528247880685]
Decentralized training faces significant challenges regarding system design and efficiency. We present FusionLLM, a decentralized training system designed and implemented for training large deep neural networks (DNNs) We show that our system and method can achieve 1.45 - 9.39x speedup compared to baseline methods while ensuring convergence.
arXiv Detail & Related papers (2024-10-16T16:13:19Z)
Learning Versatile 3D Shape Generation with Improved AR Models [91.87115744375052]
Auto-regressive (AR) models have achieved impressive results in 2D image generation by modeling joint distributions in the grid space. We propose the Improved Auto-regressive Model (ImAM) for 3D shape generation, which applies discrete representation learning based on a latent vector instead of volumetric grids.
arXiv Detail & Related papers (2023-03-26T12:03:18Z)
Galvatron: Efficient Transformer Training over Multiple GPUs Using Automatic Parallelism [25.928940638269534]
We propose Galvatron, a framework that automatically finds the most efficient hybrid parallelism strategy. Galvatron always achieves superior system throughput compared to previous work with limited parallelism.
arXiv Detail & Related papers (2022-11-25T03:45:31Z)
On Optimizing the Communication of Model Parallelism [74.15423270435949]
We study a novel and important communication pattern in large-scale model-parallel deep learning (DL) In cross-mesh resharding, a sharded tensor needs to be sent from a source device mesh to a destination device mesh. We propose two contributions to address cross-mesh resharding: an efficient broadcast-based communication system, and an "overlapping-friendly" pipeline schedule.
arXiv Detail & Related papers (2022-11-10T03:56:48Z)
Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning [54.99749970495241]
Alpa automates model-parallel training of large deep learning (DL) models. Alpa generates execution plans that unify data, operator, and pipeline parallelism. Unlike specialized systems, Alpa also generalizes to models with heterogeneous architectures and models without manually-designed plans.
arXiv Detail & Related papers (2022-01-28T10:13:35Z)
SplitBrain: Hybrid Data and Model Parallel Deep Learning [11.63431725146897]
This paper presents SplitBrain, a high performance distributed deep learning framework supporting hybrid data and model parallelism. Specifically, SplitBrain provides layer-specific partitioning that co-locates compute intensive convolutional layers while sharding memory demanding layers. Results show that SplitBrain can achieve nearly linear speedup while saving up to 67% of memory consumption for data and model parallel VGG over CIFAR-10.
arXiv Detail & Related papers (2021-12-31T06:25:38Z)
Automatic Graph Partitioning for Very Large-scale Deep Learning [4.472135966077758]
This work proposes RaNNC (Rapid Neural Network Connector) as for automatic hybrid parallelism. RaNNC automatically partitions the model into a set of subcomponents so that each subcomponent fits a device memory. RaNNC successfully trained models five times larger than those Megatron-LM could, and RaNNC's training throughputs were comparable to Megatron-LM's when pre-training the same models.
arXiv Detail & Related papers (2021-03-30T04:26:04Z)
TeraPipe: Token-Level Pipeline Parallelism for Training Large-Scale Language Models [60.23234205219347]
TeraPipe is a high-performance token-level pipeline parallel algorithm for synchronous model-parallel training of Transformer-based language models. We show that TeraPipe can speed up the training by 5.0x for the largest GPT-3 model with 175 billion parameters on an AWS cluster.
arXiv Detail & Related papers (2021-02-16T07:34:32Z)
The Case for Strong Scaling in Deep Learning: Training Large 3D CNNs with Hybrid Parallelism [3.4377970608678314]
We present scalable hybrid-parallel algorithms for training large-scale 3D convolutional neural networks. We evaluate our proposed training algorithms with two challenging 3D CNNs, CosmoFlow and 3D U-Net.
arXiv Detail & Related papers (2020-07-25T05:06:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.