Related papers: Colossal-AI: A Unified Deep Learning System For Large-Scale Parallel Training

Colossal-AI: A Unified Deep Learning System For Large-Scale Parallel Training

URL: http://arxiv.org/abs/2110.14883v3
Date: Thu, 5 Oct 2023 04:09:09 GMT
Title: Colossal-AI: A Unified Deep Learning System For Large-Scale Parallel Training
Authors: Shenggui Li and Hongxin Liu and Zhengda Bian and Jiarui Fang and Haichen Huang and Yuliang Liu and Boxiang Wang and Yang You
Abstract summary: Colossal-AI can achieve up to 2.76 times training speedup on large-scale models. System supports parallel training methods such as data, pipeline, tensor, and sequence parallelism.
Score: 23.633810934134065
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The success of Transformer models has pushed the deep learning model scale to billions of parameters. Due to the limited memory resource of a single GPU, However, the best practice for choosing the optimal parallel strategy is still lacking, since it requires domain expertise in both deep learning and parallel computing. The Colossal-AI system addressed the above challenge by introducing a unified interface to scale your sequential code of model training to distributed environments. It supports parallel training methods such as data, pipeline, tensor, and sequence parallelism, as well as heterogeneous training methods integrated with zero redundancy optimizer. Compared to the baseline system, Colossal-AI can achieve up to 2.76 times training speedup on large-scale models.

Related papers

AReaL-Hex: Accommodating Asynchronous RL Training over Heterogeneous GPUs [24.96730768606278]
We present AReaL-Hex, a heterogeneous-aware asynchronous RL training system.<n>It effectively schedules how to execute rollout generation and policy model training over heterogeneous GPUs.<n>It delivers up to 1.50x higher training throughput and 1.46x reduction in training cost.
arXiv Detail & Related papers (2025-11-02T04:17:30Z)
Two-dimensional Sparse Parallelism for Large Scale Deep Learning Recommendation Model Training [9.47829333855806]
In deep learning recommendation models (DLRM), the sparse embedding table is a crucial component for managing sparse categorical features.<n>We propose a novel two-dimensional sparse parallelism approach to overcome scalability challenges.<n>We show that the proposed approach significantly enhances training efficiency while maintaining model performance parity.
arXiv Detail & Related papers (2025-08-05T19:12:18Z)
AutoHete: An Automatic and Efficient Heterogeneous Training System for LLMs [68.99086112477565]
Transformer-based large language models (LLMs) have demonstrated exceptional capabilities in sequence modeling and text generation. Existing heterogeneous training methods significantly expand the scale of trainable models but introduce substantial communication overheads and CPU workloads. We propose AutoHete, an automatic and efficient heterogeneous training system compatible with both single- GPU and multi- GPU environments.
arXiv Detail & Related papers (2025-02-27T14:46:22Z)
Partitioned Neural Network Training via Synthetic Intermediate Labels [0.0]
GPU memory constraints have become a notable bottleneck in training such sizable models. This study advocates partitioning the model across GPU and generating synthetic intermediate labels to train individual segments. This approach results in a more efficient training process that minimizes data communication while maintaining model accuracy.
arXiv Detail & Related papers (2024-03-17T13:06:29Z)
ATOM: Asynchronous Training of Massive Models for Deep Learning in a Decentralized Environment [7.916080032572087]
atom is a resilient distributed training framework designed for asynchronous training of vast models in a decentralized setting. atom aims to accommodate a complete LLM on one host (peer) through seamlessly model swapping and concurrently trains multiple copies across various peers to optimize training throughput. Our experiments using different GPT-3 model configurations reveal that, in scenarios with suboptimal network connections, atom can enhance training efficiency up to $20 times$ when juxtaposed with the state-of-the-art decentralized pipeline parallelism approaches.
arXiv Detail & Related papers (2024-03-15T17:43:43Z)
SWARM Parallelism: Training Large Models Can Be Surprisingly Communication-Efficient [69.61083127540776]
Deep learning applications benefit from using large models with billions of parameters. Training these models is notoriously expensive due to the need for specialized HPC clusters. We consider alternative setups for training large models: using cheap "preemptible" instances or pooling existing resources from multiple regions.
arXiv Detail & Related papers (2023-01-27T18:55:19Z)
Does compressing activations help model parallel training? [64.59298055364336]
We present the first empirical study on the effectiveness of compression methods for model parallelism. We implement and evaluate three common classes of compression algorithms. We evaluate these methods across more than 160 settings and 8 popular datasets.
arXiv Detail & Related papers (2023-01-06T18:58:09Z)
Decentralized Training of Foundation Models in Heterogeneous Environments [77.47261769795992]
Training foundation models, such as GPT-3 and PaLM, can be extremely expensive. We present the first study of training large foundation models with model parallelism in a decentralized regime over a heterogeneous network.
arXiv Detail & Related papers (2022-06-02T20:19:51Z)
TeraPipe: Token-Level Pipeline Parallelism for Training Large-Scale Language Models [60.23234205219347]
TeraPipe is a high-performance token-level pipeline parallel algorithm for synchronous model-parallel training of Transformer-based language models. We show that TeraPipe can speed up the training by 5.0x for the largest GPT-3 model with 175 billion parameters on an AWS cluster.
arXiv Detail & Related papers (2021-02-16T07:34:32Z)
Parallel Training of Deep Networks with Local Updates [84.30918922367442]
Local parallelism is a framework which parallelizes training of individual layers in deep networks by replacing global backpropagation with truncated layer-wise backpropagation. We show results in both vision and language domains across a diverse set of architectures, and find that local parallelism is particularly effective in the high-compute regime.
arXiv Detail & Related papers (2020-12-07T16:38:45Z)
Scaling Distributed Deep Learning Workloads beyond the Memory Capacity with KARMA [58.040931661693925]
We propose a strategy that combines redundant recomputing and out-of-core methods. We achieve an average of 1.52x speedup in six different models over the state-of-the-art out-of-core methods. Our data parallel out-of-core solution can outperform complex hybrid model parallelism in training large models, e.g. Megatron-LM and Turning-NLG.
arXiv Detail & Related papers (2020-08-26T07:24:34Z)
The Case for Strong Scaling in Deep Learning: Training Large 3D CNNs with Hybrid Parallelism [3.4377970608678314]
We present scalable hybrid-parallel algorithms for training large-scale 3D convolutional neural networks. We evaluate our proposed training algorithms with two challenging 3D CNNs, CosmoFlow and 3D U-Net.
arXiv Detail & Related papers (2020-07-25T05:06:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.