Fine-tuning Language Models over Slow Networks using Activation
Compression with Guarantees
- URL: http://arxiv.org/abs/2206.01299v1
- Date: Thu, 2 Jun 2022 20:49:12 GMT
- Title: Fine-tuning Language Models over Slow Networks using Activation
Compression with Guarantees
- Authors: Jue Wang, Binhang Yuan, Luka Rimanic, Yongjun He, Tri Dao, Beidi Chen,
Christopher Re, Ce Zhang
- Abstract summary: We show that AC-SGD can be combined with state-of-the-art gradient compression algorithms to enable "end-to-end compression"
AC-SGD provides up to 4.3X end-to-end speed-up in slower networks, without sacrificing model quality.
- Score: 33.38465345409054
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Communication compression is a crucial technique for modern distributed
learning systems to alleviate their communication bottlenecks over slower
networks. Despite recent intensive studies of gradient compression for data
parallel-style training, compressing the activations for models trained with
pipeline parallelism is still an open problem. In this paper, we propose
AC-SGD, a novel activation compression algorithm for communication-efficient
pipeline parallelism training over slow networks. Different from previous
efforts in activation compression, instead of compressing activation values
directly, AC-SGD compresses the changes of the activations. This allows us to
show, to the best of our knowledge for the first time, that one can still
achieve $O(1/\sqrt{T})$ convergence rate for non-convex objectives under
activation compression, without making assumptions on gradient unbiasedness
that do not hold for deep learning models with non-linear activation
functions.We then show that AC-SGD can be optimized and implemented
efficiently, without additional end-to-end runtime overhead.We evaluated AC-SGD
to fine-tune language models with up to 1.5 billion parameters, compressing
activations to 2-4 bits.AC-SGD provides up to 4.3X end-to-end speed-up in
slower networks, without sacrificing model quality. Moreover, we also show that
AC-SGD can be combined with state-of-the-art gradient compression algorithms to
enable "end-to-end communication compression: All communications between
machines, including model gradients, forward activations, and backward
gradients are compressed into lower precision.This provides up to 4.9X
end-to-end speed-up, without sacrificing model quality.
Related papers
- Accelerating Large Language Model Training with Hybrid GPU-based Compression [3.204387803072905]
MPI libraries have been proven to reduce message size significantly and leverage interconnect bandwidth.
We investigate the efficacy of compression-assisted MPI collectives under the context of distributed Large Language Model (LLM) training.
arXiv Detail & Related papers (2024-09-04T04:05:30Z) - Token Compensator: Altering Inference Cost of Vision Transformer without Re-Tuning [63.43972993473501]
Token compression expedites the training and inference of Vision Transformers (ViTs)
However, when applied to downstream tasks, compression degrees are mismatched between training and inference stages.
We propose a model arithmetic framework to decouple the compression degrees between the two stages.
arXiv Detail & Related papers (2024-08-13T10:36:43Z) - Accelerating Communication in Deep Learning Recommendation Model Training with Dual-Level Adaptive Lossy Compression [10.233937665979694]
DLRM is a state-of-the-art recommendation system model that has gained widespread adoption across various industry applications.
A significant bottleneck in this process is the time-consuming all-to-all communication required to collect embedding data from all devices.
We introduce a method that employs error-bounded lossy compression to reduce the communication data size and accelerate DLRM training.
arXiv Detail & Related papers (2024-07-05T05:55:18Z) - Communication-Efficient Distributed Learning with Local Immediate Error
Compensation [95.6828475028581]
We propose the Local Immediate Error Compensated SGD (LIEC-SGD) optimization algorithm.
LIEC-SGD is superior to previous works in either the convergence rate or the communication cost.
arXiv Detail & Related papers (2024-02-19T05:59:09Z) - Activations and Gradients Compression for Model-Parallel Training [85.99744701008802]
We study how simultaneous compression of activations and gradients in model-parallel distributed training setup affects convergence.
We find that gradients require milder compression rates than activations.
Experiments also show that models trained with TopK perform well only when compression is also applied during inference.
arXiv Detail & Related papers (2024-01-15T15:54:54Z) - GraVAC: Adaptive Compression for Communication-Efficient Distributed DL
Training [0.0]
Distributed data-parallel (DDP) training improves overall application throughput as multiple devices train on a subset of data and aggregate updates to produce a globally shared model.
GraVAC is a framework to dynamically adjust compression factor throughout training by evaluating model progress and assessing information loss associated with compression.
As opposed to using a static compression factor, GraVAC reduces end-to-end training time for ResNet101, VGG16 and LSTM by 4.32x, 1.95x and 6.67x respectively.
arXiv Detail & Related papers (2023-05-20T14:25:17Z) - L-GreCo: Layerwise-Adaptive Gradient Compression for Efficient and
Accurate Deep Learning [24.712888488317816]
We provide a framework for adapting the degree of compression across the model's layers dynamically during training.
Our framework, called L-GreCo, is based on an adaptive algorithm, which automatically picks the optimal compression parameters for model layers.
arXiv Detail & Related papers (2022-10-31T14:37:41Z) - Towards Compact CNNs via Collaborative Compression [166.86915086497433]
We propose a Collaborative Compression scheme, which joints channel pruning and tensor decomposition to compress CNN models.
We achieve 52.9% FLOPs reduction by removing 48.4% parameters on ResNet-50 with only a Top-1 accuracy drop of 0.56% on ImageNet 2012.
arXiv Detail & Related papers (2021-05-24T12:07:38Z) - An Efficient Statistical-based Gradient Compression Technique for
Distributed Training Systems [77.88178159830905]
Sparsity-Inducing Distribution-based Compression (SIDCo) is a threshold-based sparsification scheme that enjoys similar threshold estimation quality to deep gradient compression (DGC)
Our evaluation shows SIDCo speeds up training by up to 41:7%, 7:6%, and 1:9% compared to the no-compression baseline, Topk, and DGC compressors, respectively.
arXiv Detail & Related papers (2021-01-26T13:06:00Z) - PowerGossip: Practical Low-Rank Communication Compression in
Decentralized Deep Learning [62.440827696638664]
We introduce a simple algorithm that directly compresses the model differences between neighboring workers.
Inspired by the PowerSGD for centralized deep learning, this algorithm uses power steps to maximize the information transferred per bit.
arXiv Detail & Related papers (2020-08-04T09:14:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.