Related papers: Sparse Communication for Training Deep Networks

Sparse Communication for Training Deep Networks

URL: http://arxiv.org/abs/2009.09271v1
Date: Sat, 19 Sep 2020 17:28:11 GMT
Title: Sparse Communication for Training Deep Networks
Authors: Negar Foroutan Eghlidi and Martin Jaggi
Abstract summary: Synchronous gradient descent (SGD) is the most common method used for distributed training of deep learning models. In this algorithm, each worker shares its local gradients with others and updates the parameters using the average gradients of all workers. We study several compression schemes and identify how three key parameters affect the performance.
Score: 56.441077560085475
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Synchronous stochastic gradient descent (SGD) is the most common method used for distributed training of deep learning models. In this algorithm, each worker shares its local gradients with others and updates the parameters using the average gradients of all workers. Although distributed training reduces the computation time, the communication overhead associated with the gradient exchange forms a scalability bottleneck for the algorithm. There are many compression techniques proposed to reduce the number of gradients that needs to be communicated. However, compressing the gradients introduces yet another overhead to the problem. In this work, we study several compression schemes and identify how three key parameters affect the performance. We also provide a set of insights on how to increase performance and introduce a simple sparsification scheme, random-block sparsification, that reduces communication while keeping the performance close to standard SGD.

Related papers

Communication-Efficient Adaptive Batch Size Strategies for Distributed Local Gradient Methods [17.006352664497122]
Modern deep neural networks often require distributed training with many workers due to their large size. As the number of workers increases, communication overheads become the main bottleneck in data-parallel minibatch gradient methods with per-iteration gradient synchronization. We introduce adaptive batch size strategies for local gradient methods that increase batch sizes adaptively to reduce minibatch gradient variance.
arXiv Detail & Related papers (2024-06-20T02:08:50Z)
Communication-Efficient Adam-Type Algorithms for Distributed Data Mining [93.50424502011626]
We propose a class of novel distributed Adam-type algorithms (emphi.e., SketchedAMSGrad) utilizing sketching. Our new algorithm achieves a fast convergence rate of $O(frac1sqrtnT + frac1(k/d)2 T)$ with the communication cost of $O(k log(d))$ at each iteration.
arXiv Detail & Related papers (2022-10-14T01:42:05Z)
Communication-Efficient Federated Learning via Quantized Compressed Sensing [82.10695943017907]
The presented framework consists of gradient compression for wireless devices and gradient reconstruction for a parameter server. Thanks to gradient sparsification and quantization, our strategy can achieve a higher compression ratio than one-bit gradient compression. We demonstrate that the framework achieves almost identical performance with the case that performs no compression.
arXiv Detail & Related papers (2021-11-30T02:13:54Z)
Quantization for Distributed Optimization [0.0]
We present a set of all-reduce gradient compatible compression schemes which significantly reduce the communication overhead while maintaining the performance of vanilla SGD. Our compression methods perform better than the in-built methods currently offered by the deep learning frameworks.
arXiv Detail & Related papers (2021-09-26T05:16:12Z)
Learned Gradient Compression for Distributed Deep Learning [16.892546958602303]
Training deep neural networks on large datasets containing high-dimensional data requires a large amount of computation. A solution to this problem is data-parallel distributed training, where a model is replicated into several computational nodes that have access to different chunks of the data. This approach, however, entails high communication rates and latency because of the computed gradients that need to be shared among nodes at every iteration.
arXiv Detail & Related papers (2021-03-16T06:42:36Z)
Efficient Distributed Auto-Differentiation [22.192220404846267]
gradient-based algorithms for training large deep neural networks (DNNs) are communication-heavy. We introduce a surprisingly simple statistic for training distributed DNNs that is more communication-friendly than the gradient. The process provides the flexibility of averaging gradients during backpropagation, enabling novel flexible training schemas.
arXiv Detail & Related papers (2021-02-18T21:46:27Z)
An Efficient Statistical-based Gradient Compression Technique for Distributed Training Systems [77.88178159830905]
Sparsity-Inducing Distribution-based Compression (SIDCo) is a threshold-based sparsification scheme that enjoys similar threshold estimation quality to deep gradient compression (DGC) Our evaluation shows SIDCo speeds up training by up to 41:7%, 7:6%, and 1:9% compared to the no-compression baseline, Topk, and DGC compressors, respectively.
arXiv Detail & Related papers (2021-01-26T13:06:00Z)
PowerGossip: Practical Low-Rank Communication Compression in Decentralized Deep Learning [62.440827696638664]
We introduce a simple algorithm that directly compresses the model differences between neighboring workers. Inspired by the PowerSGD for centralized deep learning, this algorithm uses power steps to maximize the information transferred per bit.
arXiv Detail & Related papers (2020-08-04T09:14:52Z)
DaSGD: Squeezing SGD Parallelization Performance in Distributed Training Using Delayed Averaging [4.652668321425679]
Minibatch gradient descent (SGD) algorithm requires workers to halt forward/back propagations. DaSGD parallelizes SGD and forward/back propagations to hide 100% of the communication overhead.
arXiv Detail & Related papers (2020-05-31T05:43:50Z)
Variance Reduction with Sparse Gradients [82.41780420431205]
Variance reduction methods such as SVRG and SpiderBoost use a mixture of large and small batch gradients. We introduce a new sparsity operator: The random-top-k operator. Our algorithm consistently outperforms SpiderBoost on various tasks including image classification, natural language processing, and sparse matrix factorization.
arXiv Detail & Related papers (2020-01-27T08:23:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.