Optimus-CC: Efficient Large NLP Model Training with 3D Parallelism Aware
Communication Compression
- URL: http://arxiv.org/abs/2301.09830v1
- Date: Tue, 24 Jan 2023 06:07:55 GMT
- Title: Optimus-CC: Efficient Large NLP Model Training with 3D Parallelism Aware
Communication Compression
- Authors: Jaeyong Song, Jinkyu Yim, Jaewon Jung, Hongsun Jang, Hyung-Jin Kim,
Youngsok Kim, Jinho Lee
- Abstract summary: We present Optimus-CC, a fast and scalable distributed training framework for large NLP models with aggressive communication compression.
We propose techniques to avoid the model quality drop that comes from the compression.
We demonstrate our solution on a GPU cluster, and achieve superior speedup from the baseline state-of-the-art solutions for distributed training without sacrificing the model quality.
- Score: 8.591088380355252
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In training of modern large natural language processing (NLP) models, it has
become a common practice to split models using 3D parallelism to multiple GPUs.
Such technique, however, suffers from a high overhead of inter-node
communication. Compressing the communication is one way to mitigate the
overhead by reducing the inter-node traffic volume; however, the existing
compression techniques have critical limitations to be applied for NLP models
with 3D parallelism in that 1) only the data parallelism traffic is targeted,
and 2) the existing compression schemes already harm the model quality too
much.
In this paper, we present Optimus-CC, a fast and scalable distributed
training framework for large NLP models with aggressive communication
compression. Optimus-CC differs from existing communication compression
frameworks in the following ways: First, we compress pipeline parallel
(inter-stage) traffic. In specific, we compress the inter-stage backpropagation
and the embedding synchronization in addition to the existing data-parallel
traffic compression methods. Second, we propose techniques to avoid the model
quality drop that comes from the compression. We further provide mathematical
and empirical analyses to show that our techniques can successfully suppress
the compression error. Lastly, we analyze the pipeline and opt to selectively
compress those traffic lying on the critical path. This further helps reduce
the compression error. We demonstrate our solution on a GPU cluster, and
achieve superior speedup from the baseline state-of-the-art solutions for
distributed training without sacrificing the model quality.
Related papers
- Fast Feedforward 3D Gaussian Splatting Compression [55.149325473447384]
3D Gaussian Splatting (FCGS) is an optimization-free model that can compress 3DGS representations rapidly in a single feed-forward pass.
FCGS achieves a compression ratio of over 20X while maintaining fidelity, surpassing most per-scene SOTA optimization-based methods.
arXiv Detail & Related papers (2024-10-10T15:13:08Z) - Accelerating Large Language Model Training with Hybrid GPU-based Compression [3.204387803072905]
MPI libraries have been proven to reduce message size significantly and leverage interconnect bandwidth.
We investigate the efficacy of compression-assisted MPI collectives under the context of distributed Large Language Model (LLM) training.
arXiv Detail & Related papers (2024-09-04T04:05:30Z) - Accelerating Communication in Deep Learning Recommendation Model Training with Dual-Level Adaptive Lossy Compression [10.233937665979694]
DLRM is a state-of-the-art recommendation system model that has gained widespread adoption across various industry applications.
A significant bottleneck in this process is the time-consuming all-to-all communication required to collect embedding data from all devices.
We introduce a method that employs error-bounded lossy compression to reduce the communication data size and accelerate DLRM training.
arXiv Detail & Related papers (2024-07-05T05:55:18Z) - Unified Low-rank Compression Framework for Click-through Rate Prediction [15.813889566241539]
We propose a unified low-rank decomposition framework for compressing CTR prediction models.
Our framework can achieve better performance than the original model.
Our framework can be applied to embedding tables and layers in various CTR prediction models.
arXiv Detail & Related papers (2024-05-28T13:06:32Z) - A Survey on Transformer Compression [84.18094368700379]
Transformer plays a vital role in the realms of natural language processing (NLP) and computer vision (CV)
Model compression methods reduce the memory and computational cost of Transformer.
This survey provides a comprehensive review of recent compression methods, with a specific focus on their application to Transformer-based models.
arXiv Detail & Related papers (2024-02-05T12:16:28Z) - Activations and Gradients Compression for Model-Parallel Training [85.99744701008802]
We study how simultaneous compression of activations and gradients in model-parallel distributed training setup affects convergence.
We find that gradients require milder compression rates than activations.
Experiments also show that models trained with TopK perform well only when compression is also applied during inference.
arXiv Detail & Related papers (2024-01-15T15:54:54Z) - Towards a Better Theoretical Understanding of Independent Subnetwork Training [56.24689348875711]
We take a closer theoretical look at Independent Subnetwork Training (IST)
IST is a recently proposed and highly effective technique for solving the aforementioned problems.
We identify fundamental differences between IST and alternative approaches, such as distributed methods with compressed communication.
arXiv Detail & Related papers (2023-06-28T18:14:22Z) - GraVAC: Adaptive Compression for Communication-Efficient Distributed DL
Training [0.0]
Distributed data-parallel (DDP) training improves overall application throughput as multiple devices train on a subset of data and aggregate updates to produce a globally shared model.
GraVAC is a framework to dynamically adjust compression factor throughout training by evaluating model progress and assessing information loss associated with compression.
As opposed to using a static compression factor, GraVAC reduces end-to-end training time for ResNet101, VGG16 and LSTM by 4.32x, 1.95x and 6.67x respectively.
arXiv Detail & Related papers (2023-05-20T14:25:17Z) - Does compressing activations help model parallel training? [64.59298055364336]
We present the first empirical study on the effectiveness of compression methods for model parallelism.
We implement and evaluate three common classes of compression algorithms.
We evaluate these methods across more than 160 settings and 8 popular datasets.
arXiv Detail & Related papers (2023-01-06T18:58:09Z) - PowerGossip: Practical Low-Rank Communication Compression in
Decentralized Deep Learning [62.440827696638664]
We introduce a simple algorithm that directly compresses the model differences between neighboring workers.
Inspired by the PowerSGD for centralized deep learning, this algorithm uses power steps to maximize the information transferred per bit.
arXiv Detail & Related papers (2020-08-04T09:14:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.