Sparse Communication for Training Deep Networks
- URL: http://arxiv.org/abs/2009.09271v1
- Date: Sat, 19 Sep 2020 17:28:11 GMT
- Title: Sparse Communication for Training Deep Networks
- Authors: Negar Foroutan Eghlidi and Martin Jaggi
- Abstract summary: Synchronous gradient descent (SGD) is the most common method used for distributed training of deep learning models.
In this algorithm, each worker shares its local gradients with others and updates the parameters using the average gradients of all workers.
We study several compression schemes and identify how three key parameters affect the performance.
- Score: 56.441077560085475
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Synchronous stochastic gradient descent (SGD) is the most common method used
for distributed training of deep learning models. In this algorithm, each
worker shares its local gradients with others and updates the parameters using
the average gradients of all workers. Although distributed training reduces the
computation time, the communication overhead associated with the gradient
exchange forms a scalability bottleneck for the algorithm. There are many
compression techniques proposed to reduce the number of gradients that needs to
be communicated. However, compressing the gradients introduces yet another
overhead to the problem. In this work, we study several compression schemes and
identify how three key parameters affect the performance. We also provide a set
of insights on how to increase performance and introduce a simple
sparsification scheme, random-block sparsification, that reduces communication
while keeping the performance close to standard SGD.
Related papers
- Communication-Efficient Adaptive Batch Size Strategies for Distributed Local Gradient Methods [17.006352664497122]
Modern deep neural networks often require distributed training with many workers due to their large size.
As the number of workers increases, communication overheads become the main bottleneck in data-parallel minibatch gradient methods with per-iteration gradient synchronization.
We introduce adaptive batch size strategies for local gradient methods that increase batch sizes adaptively to reduce minibatch gradient variance.
arXiv Detail & Related papers (2024-06-20T02:08:50Z) - Communication-Efficient Adam-Type Algorithms for Distributed Data Mining [93.50424502011626]
We propose a class of novel distributed Adam-type algorithms (emphi.e., SketchedAMSGrad) utilizing sketching.
Our new algorithm achieves a fast convergence rate of $O(frac1sqrtnT + frac1(k/d)2 T)$ with the communication cost of $O(k log(d))$ at each iteration.
arXiv Detail & Related papers (2022-10-14T01:42:05Z) - Communication-Efficient Federated Learning via Quantized Compressed
Sensing [82.10695943017907]
The presented framework consists of gradient compression for wireless devices and gradient reconstruction for a parameter server.
Thanks to gradient sparsification and quantization, our strategy can achieve a higher compression ratio than one-bit gradient compression.
We demonstrate that the framework achieves almost identical performance with the case that performs no compression.
arXiv Detail & Related papers (2021-11-30T02:13:54Z) - Quantization for Distributed Optimization [0.0]
We present a set of all-reduce gradient compatible compression schemes which significantly reduce the communication overhead while maintaining the performance of vanilla SGD.
Our compression methods perform better than the in-built methods currently offered by the deep learning frameworks.
arXiv Detail & Related papers (2021-09-26T05:16:12Z) - Learned Gradient Compression for Distributed Deep Learning [16.892546958602303]
Training deep neural networks on large datasets containing high-dimensional data requires a large amount of computation.
A solution to this problem is data-parallel distributed training, where a model is replicated into several computational nodes that have access to different chunks of the data.
This approach, however, entails high communication rates and latency because of the computed gradients that need to be shared among nodes at every iteration.
arXiv Detail & Related papers (2021-03-16T06:42:36Z) - Efficient Distributed Auto-Differentiation [22.192220404846267]
gradient-based algorithms for training large deep neural networks (DNNs) are communication-heavy.
We introduce a surprisingly simple statistic for training distributed DNNs that is more communication-friendly than the gradient.
The process provides the flexibility of averaging gradients during backpropagation, enabling novel flexible training schemas.
arXiv Detail & Related papers (2021-02-18T21:46:27Z) - An Efficient Statistical-based Gradient Compression Technique for
Distributed Training Systems [77.88178159830905]
Sparsity-Inducing Distribution-based Compression (SIDCo) is a threshold-based sparsification scheme that enjoys similar threshold estimation quality to deep gradient compression (DGC)
Our evaluation shows SIDCo speeds up training by up to 41:7%, 7:6%, and 1:9% compared to the no-compression baseline, Topk, and DGC compressors, respectively.
arXiv Detail & Related papers (2021-01-26T13:06:00Z) - PowerGossip: Practical Low-Rank Communication Compression in
Decentralized Deep Learning [62.440827696638664]
We introduce a simple algorithm that directly compresses the model differences between neighboring workers.
Inspired by the PowerSGD for centralized deep learning, this algorithm uses power steps to maximize the information transferred per bit.
arXiv Detail & Related papers (2020-08-04T09:14:52Z) - DaSGD: Squeezing SGD Parallelization Performance in Distributed Training
Using Delayed Averaging [4.652668321425679]
Minibatch gradient descent (SGD) algorithm requires workers to halt forward/back propagations.
DaSGD parallelizes SGD and forward/back propagations to hide 100% of the communication overhead.
arXiv Detail & Related papers (2020-05-31T05:43:50Z) - Variance Reduction with Sparse Gradients [82.41780420431205]
Variance reduction methods such as SVRG and SpiderBoost use a mixture of large and small batch gradients.
We introduce a new sparsity operator: The random-top-k operator.
Our algorithm consistently outperforms SpiderBoost on various tasks including image classification, natural language processing, and sparse matrix factorization.
arXiv Detail & Related papers (2020-01-27T08:23:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.