Colossal-AI: A Unified Deep Learning System For Large-Scale Parallel
Training
- URL: http://arxiv.org/abs/2110.14883v3
- Date: Thu, 5 Oct 2023 04:09:09 GMT
- Title: Colossal-AI: A Unified Deep Learning System For Large-Scale Parallel
Training
- Authors: Shenggui Li and Hongxin Liu and Zhengda Bian and Jiarui Fang and
Haichen Huang and Yuliang Liu and Boxiang Wang and Yang You
- Abstract summary: Colossal-AI can achieve up to 2.76 times training speedup on large-scale models.
System supports parallel training methods such as data, pipeline, tensor, and sequence parallelism.
- Score: 23.633810934134065
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The success of Transformer models has pushed the deep learning model scale to
billions of parameters. Due to the limited memory resource of a single GPU,
However, the best practice for choosing the optimal parallel strategy is still
lacking, since it requires domain expertise in both deep learning and parallel
computing.
The Colossal-AI system addressed the above challenge by introducing a unified
interface to scale your sequential code of model training to distributed
environments. It supports parallel training methods such as data, pipeline,
tensor, and sequence parallelism, as well as heterogeneous training methods
integrated with zero redundancy optimizer. Compared to the baseline system,
Colossal-AI can achieve up to 2.76 times training speedup on large-scale
models.
Related papers
- Partitioned Neural Network Training via Synthetic Intermediate Labels [0.0]
GPU memory constraints have become a notable bottleneck in training such sizable models.
This study advocates partitioning the model across GPU and generating synthetic intermediate labels to train individual segments.
This approach results in a more efficient training process that minimizes data communication while maintaining model accuracy.
arXiv Detail & Related papers (2024-03-17T13:06:29Z) - ATOM: Asynchronous Training of Massive Models for Deep Learning in a Decentralized Environment [7.916080032572087]
atom is a resilient distributed training framework designed for asynchronous training of vast models in a decentralized setting.
atom aims to accommodate a complete LLM on one host (peer) through seamlessly model swapping and concurrently trains multiple copies across various peers to optimize training throughput.
Our experiments using different GPT-3 model configurations reveal that, in scenarios with suboptimal network connections, atom can enhance training efficiency up to $20 times$ when juxtaposed with the state-of-the-art decentralized pipeline parallelism approaches.
arXiv Detail & Related papers (2024-03-15T17:43:43Z) - SWARM Parallelism: Training Large Models Can Be Surprisingly
Communication-Efficient [69.61083127540776]
Deep learning applications benefit from using large models with billions of parameters.
Training these models is notoriously expensive due to the need for specialized HPC clusters.
We consider alternative setups for training large models: using cheap "preemptible" instances or pooling existing resources from multiple regions.
arXiv Detail & Related papers (2023-01-27T18:55:19Z) - Does compressing activations help model parallel training? [64.59298055364336]
We present the first empirical study on the effectiveness of compression methods for model parallelism.
We implement and evaluate three common classes of compression algorithms.
We evaluate these methods across more than 160 settings and 8 popular datasets.
arXiv Detail & Related papers (2023-01-06T18:58:09Z) - Decentralized Training of Foundation Models in Heterogeneous
Environments [77.47261769795992]
Training foundation models, such as GPT-3 and PaLM, can be extremely expensive.
We present the first study of training large foundation models with model parallelism in a decentralized regime over a heterogeneous network.
arXiv Detail & Related papers (2022-06-02T20:19:51Z) - TeraPipe: Token-Level Pipeline Parallelism for Training Large-Scale
Language Models [60.23234205219347]
TeraPipe is a high-performance token-level pipeline parallel algorithm for synchronous model-parallel training of Transformer-based language models.
We show that TeraPipe can speed up the training by 5.0x for the largest GPT-3 model with 175 billion parameters on an AWS cluster.
arXiv Detail & Related papers (2021-02-16T07:34:32Z) - Parallel Training of Deep Networks with Local Updates [84.30918922367442]
Local parallelism is a framework which parallelizes training of individual layers in deep networks by replacing global backpropagation with truncated layer-wise backpropagation.
We show results in both vision and language domains across a diverse set of architectures, and find that local parallelism is particularly effective in the high-compute regime.
arXiv Detail & Related papers (2020-12-07T16:38:45Z) - Scaling Distributed Deep Learning Workloads beyond the Memory Capacity
with KARMA [58.040931661693925]
We propose a strategy that combines redundant recomputing and out-of-core methods.
We achieve an average of 1.52x speedup in six different models over the state-of-the-art out-of-core methods.
Our data parallel out-of-core solution can outperform complex hybrid model parallelism in training large models, e.g. Megatron-LM and Turning-NLG.
arXiv Detail & Related papers (2020-08-26T07:24:34Z) - The Case for Strong Scaling in Deep Learning: Training Large 3D CNNs
with Hybrid Parallelism [3.4377970608678314]
We present scalable hybrid-parallel algorithms for training large-scale 3D convolutional neural networks.
We evaluate our proposed training algorithms with two challenging 3D CNNs, CosmoFlow and 3D U-Net.
arXiv Detail & Related papers (2020-07-25T05:06:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.