An Efficient Statistical-based Gradient Compression Technique for
Distributed Training Systems
- URL: http://arxiv.org/abs/2101.10761v1
- Date: Tue, 26 Jan 2021 13:06:00 GMT
- Title: An Efficient Statistical-based Gradient Compression Technique for
Distributed Training Systems
- Authors: Ahmed M. Abdelmoniem and Ahmed Elzanaty and Mohamed-Slim Alouini and
Marco Canini
- Abstract summary: Sparsity-Inducing Distribution-based Compression (SIDCo) is a threshold-based sparsification scheme that enjoys similar threshold estimation quality to deep gradient compression (DGC)
Our evaluation shows SIDCo speeds up training by up to 41:7%, 7:6%, and 1:9% compared to the no-compression baseline, Topk, and DGC compressors, respectively.
- Score: 77.88178159830905
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The recent many-fold increase in the size of deep neural networks makes
efficient distributed training challenging. Many proposals exploit the
compressibility of the gradients and propose lossy compression techniques to
speed up the communication stage of distributed training. Nevertheless,
compression comes at the cost of reduced model quality and extra computation
overhead. In this work, we design an efficient compressor with minimal
overhead. Noting the sparsity of the gradients, we propose to model the
gradients as random variables distributed according to some sparsity-inducing
distributions (SIDs). We empirically validate our assumption by studying the
statistical characteristics of the evolution of gradient vectors over the
training process. We then propose Sparsity-Inducing Distribution-based
Compression (SIDCo), a threshold-based sparsification scheme that enjoys
similar threshold estimation quality to deep gradient compression (DGC) while
being faster by imposing lower compression overhead. Our extensive evaluation
of popular machine learning benchmarks involving both recurrent neural network
(RNN) and convolution neural network (CNN) models shows that SIDCo speeds up
training by up to 41:7%, 7:6%, and 1:9% compared to the no-compression
baseline, Topk, and DGC compressors, respectively.
Related papers
- Flattened one-bit stochastic gradient descent: compressed distributed optimization with controlled variance [55.01966743652196]
We propose a novel algorithm for distributed gradient descent (SGD) with compressed gradient communication in the parameter-server framework.
Our gradient compression technique, named flattened one-bit gradient descent (FO-SGD), relies on two simple algorithmic ideas.
arXiv Detail & Related papers (2024-05-17T21:17:27Z) - Accelerating Distributed Deep Learning using Lossless Homomorphic
Compression [17.654138014999326]
We introduce a novel compression algorithm that effectively merges worker-level compression with in-network aggregation.
We show up to a 6.33$times$ improvement in aggregation throughput and a 3.74$times$ increase in per-iteration training speed.
arXiv Detail & Related papers (2024-02-12T09:57:47Z) - L-GreCo: Layerwise-Adaptive Gradient Compression for Efficient and
Accurate Deep Learning [24.712888488317816]
We provide a framework for adapting the degree of compression across the model's layers dynamically during training.
Our framework, called L-GreCo, is based on an adaptive algorithm, which automatically picks the optimal compression parameters for model layers.
arXiv Detail & Related papers (2022-10-31T14:37:41Z) - Quantization for Distributed Optimization [0.0]
We present a set of all-reduce gradient compatible compression schemes which significantly reduce the communication overhead while maintaining the performance of vanilla SGD.
Our compression methods perform better than the in-built methods currently offered by the deep learning frameworks.
arXiv Detail & Related papers (2021-09-26T05:16:12Z) - Compressed Communication for Distributed Training: Adaptive Methods and
System [13.244482588437972]
Communication overhead severely hinders the scalability of distributed machine learning systems.
Recently, there has been a growing interest in using gradient compression to reduce the communication overhead.
In this paper, we first introduce a novel adaptive gradient method with gradient compression.
arXiv Detail & Related papers (2021-05-17T13:41:47Z) - ScaleCom: Scalable Sparsified Gradient Compression for
Communication-Efficient Distributed Training [74.43625662170284]
Large-scale distributed training of Deep Neural Networks (DNNs) on state-of-the-art platforms is expected to be severely communication constrained.
We propose a new compression technique that leverages similarity in the gradient distribution amongst learners to provide significantly improved scalability.
We experimentally demonstrate that ScaleCom has small overheads, directly reduces gradient traffic and provides high compression rates (65-400X) and excellent scalability (up to 64 learners and 8-12X larger batch sizes over standard training) without significant accuracy loss.
arXiv Detail & Related papers (2021-04-21T02:22:10Z) - On the Utility of Gradient Compression in Distributed Training Systems [9.017890174185872]
We evaluate the efficacy of gradient compression methods and compare their scalability with optimized implementations of synchronous data-parallel SGD.
Surprisingly, we observe that due to computation overheads introduced by gradient compression, the net speedup over vanilla data-parallel training is marginal, if not negative.
arXiv Detail & Related papers (2021-02-28T15:58:45Z) - Sparse Communication for Training Deep Networks [56.441077560085475]
Synchronous gradient descent (SGD) is the most common method used for distributed training of deep learning models.
In this algorithm, each worker shares its local gradients with others and updates the parameters using the average gradients of all workers.
We study several compression schemes and identify how three key parameters affect the performance.
arXiv Detail & Related papers (2020-09-19T17:28:11Z) - PowerGossip: Practical Low-Rank Communication Compression in
Decentralized Deep Learning [62.440827696638664]
We introduce a simple algorithm that directly compresses the model differences between neighboring workers.
Inspired by the PowerSGD for centralized deep learning, this algorithm uses power steps to maximize the information transferred per bit.
arXiv Detail & Related papers (2020-08-04T09:14:52Z) - Structured Sparsification with Joint Optimization of Group Convolution
and Channel Shuffle [117.95823660228537]
We propose a novel structured sparsification method for efficient network compression.
The proposed method automatically induces structured sparsity on the convolutional weights.
We also address the problem of inter-group communication with a learnable channel shuffle mechanism.
arXiv Detail & Related papers (2020-02-19T12:03:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.