Does compressing activations help model parallel training?
- URL: http://arxiv.org/abs/2301.02654v1
- Date: Fri, 6 Jan 2023 18:58:09 GMT
- Title: Does compressing activations help model parallel training?
- Authors: Song Bian, Dacheng Li, Hongyi Wang, Eric P. Xing, Shivaram
Venkataraman
- Abstract summary: We present the first empirical study on the effectiveness of compression methods for model parallelism.
We implement and evaluate three common classes of compression algorithms.
We evaluate these methods across more than 160 settings and 8 popular datasets.
- Score: 64.59298055364336
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large-scale Transformer models are known for their exceptional performance in
a range of tasks, but training them can be difficult due to the requirement for
communication-intensive model parallelism. One way to improve training speed is
to compress the message size in communication. Previous approaches have
primarily focused on compressing gradients in a data parallelism setting, but
compression in a model-parallel setting is an understudied area. We have
discovered that model parallelism has fundamentally different characteristics
than data parallelism. In this work, we present the first empirical study on
the effectiveness of compression methods for model parallelism. We implement
and evaluate three common classes of compression algorithms - pruning-based,
learning-based, and quantization-based - using a popular Transformer training
framework. We evaluate these methods across more than 160 settings and 8
popular datasets, taking into account different hyperparameters, hardware, and
both fine-tuning and pre-training stages. We also provide analysis when the
model is scaled up. Finally, we provide insights for future development of
model parallelism compression algorithms.
Related papers
- A Survey on Transformer Compression [84.18094368700379]
Transformer plays a vital role in the realms of natural language processing (NLP) and computer vision (CV)
Model compression methods reduce the memory and computational cost of Transformer.
This survey provides a comprehensive review of recent compression methods, with a specific focus on their application to Transformer-based models.
arXiv Detail & Related papers (2024-02-05T12:16:28Z) - Activations and Gradients Compression for Model-Parallel Training [85.99744701008802]
We study how simultaneous compression of activations and gradients in model-parallel distributed training setup affects convergence.
We find that gradients require milder compression rates than activations.
Experiments also show that models trained with TopK perform well only when compression is also applied during inference.
arXiv Detail & Related papers (2024-01-15T15:54:54Z) - SWARM Parallelism: Training Large Models Can Be Surprisingly
Communication-Efficient [69.61083127540776]
Deep learning applications benefit from using large models with billions of parameters.
Training these models is notoriously expensive due to the need for specialized HPC clusters.
We consider alternative setups for training large models: using cheap "preemptible" instances or pooling existing resources from multiple regions.
arXiv Detail & Related papers (2023-01-27T18:55:19Z) - Optimus-CC: Efficient Large NLP Model Training with 3D Parallelism Aware
Communication Compression [8.591088380355252]
We present Optimus-CC, a fast and scalable distributed training framework for large NLP models with aggressive communication compression.
We propose techniques to avoid the model quality drop that comes from the compression.
We demonstrate our solution on a GPU cluster, and achieve superior speedup from the baseline state-of-the-art solutions for distributed training without sacrificing the model quality.
arXiv Detail & Related papers (2023-01-24T06:07:55Z) - Colossal-AI: A Unified Deep Learning System For Large-Scale Parallel
Training [23.633810934134065]
Colossal-AI can achieve up to 2.76 times training speedup on large-scale models.
System supports parallel training methods such as data, pipeline, tensor, and sequence parallelism.
arXiv Detail & Related papers (2021-10-28T04:45:55Z) - TeraPipe: Token-Level Pipeline Parallelism for Training Large-Scale
Language Models [60.23234205219347]
TeraPipe is a high-performance token-level pipeline parallel algorithm for synchronous model-parallel training of Transformer-based language models.
We show that TeraPipe can speed up the training by 5.0x for the largest GPT-3 model with 175 billion parameters on an AWS cluster.
arXiv Detail & Related papers (2021-02-16T07:34:32Z) - Scaling Distributed Deep Learning Workloads beyond the Memory Capacity
with KARMA [58.040931661693925]
We propose a strategy that combines redundant recomputing and out-of-core methods.
We achieve an average of 1.52x speedup in six different models over the state-of-the-art out-of-core methods.
Our data parallel out-of-core solution can outperform complex hybrid model parallelism in training large models, e.g. Megatron-LM and Turning-NLG.
arXiv Detail & Related papers (2020-08-26T07:24:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.