S2 Reducer: High-Performance Sparse Communication to Accelerate
Distributed Deep Learning
- URL: http://arxiv.org/abs/2110.02140v1
- Date: Tue, 5 Oct 2021 16:14:40 GMT
- Title: S2 Reducer: High-Performance Sparse Communication to Accelerate
Distributed Deep Learning
- Authors: Keshi Ge, Yongquan Fu, Zhiquan Lai, Xiaoge Deng, Dongsheng Li
- Abstract summary: We propose Sparse-Sketch Reducer (S2 Reducer), a novel sketch-based sparse gradient aggregation method with convergence guarantees.
S2 Reducer reduces the communication cost by only compressing the non-zero gradients with count-sketch and bitmap.
Our results show that S2 Reducer converges to the same accuracy, reduces 81% sparse communication overhead, and 1.8$ times $ speedup compared to state-of-the-art approaches.
- Score: 11.21739015522637
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Distributed stochastic gradient descent (SGD) approach has been widely used
in large-scale deep learning, and the gradient collective method is vital to
ensure the training scalability of the distributed deep learning system.
Collective communication such as AllReduce has been widely adopted for the
distributed SGD process to reduce the communication time. However, AllReduce
incurs large bandwidth resources while most gradients are sparse in many cases
since many gradient values are zeros and should be efficiently compressed for
bandwidth saving. To reduce the sparse gradient communication overhead, we
propose Sparse-Sketch Reducer (S2 Reducer), a novel sketch-based sparse
gradient aggregation method with convergence guarantees. S2 Reducer reduces the
communication cost by only compressing the non-zero gradients with count-sketch
and bitmap, and enables the efficient AllReduce operators for parallel SGD
training. We perform extensive evaluation against four state-of-the-art methods
over five training models. Our results show that S2 Reducer converges to the
same accuracy, reduces 81\% sparse communication overhead, and achieves 1.8$
\times $ speedup compared to state-of-the-art approaches.
Related papers
- Communication-Efficient Adaptive Batch Size Strategies for Distributed Local Gradient Methods [17.006352664497122]
Modern deep neural networks often require distributed training with many workers due to their large size.
As the number of workers increases, communication overheads become the main bottleneck in data-parallel minibatch gradient methods with per-iteration gradient synchronization.
We introduce adaptive batch size strategies for local gradient methods that increase batch sizes adaptively to reduce minibatch gradient variance.
arXiv Detail & Related papers (2024-06-20T02:08:50Z) - Flattened one-bit stochastic gradient descent: compressed distributed optimization with controlled variance [55.01966743652196]
We propose a novel algorithm for distributed gradient descent (SGD) with compressed gradient communication in the parameter-server framework.
Our gradient compression technique, named flattened one-bit gradient descent (FO-SGD), relies on two simple algorithmic ideas.
arXiv Detail & Related papers (2024-05-17T21:17:27Z) - RS-DGC: Exploring Neighborhood Statistics for Dynamic Gradient
Compression on Remote Sensing Image Interpretation [23.649838489244917]
gradient sparsification has been validated as an effective gradient compression (GC) technique for reducing communication costs.
We propose a simple yet effective dynamic gradient compression scheme leveraging neighborhood statistics indicator for RS image interpretation, RS-DGC.
We achieve an accuracy improvement of 0.51% with more than 50 times communication compression on the NWPU-RESISC45 dataset.
arXiv Detail & Related papers (2023-12-29T09:24:26Z) - Communication-Efficient Federated Learning via Quantized Compressed
Sensing [82.10695943017907]
The presented framework consists of gradient compression for wireless devices and gradient reconstruction for a parameter server.
Thanks to gradient sparsification and quantization, our strategy can achieve a higher compression ratio than one-bit gradient compression.
We demonstrate that the framework achieves almost identical performance with the case that performs no compression.
arXiv Detail & Related papers (2021-11-30T02:13:54Z) - Compressed Communication for Distributed Training: Adaptive Methods and
System [13.244482588437972]
Communication overhead severely hinders the scalability of distributed machine learning systems.
Recently, there has been a growing interest in using gradient compression to reduce the communication overhead.
In this paper, we first introduce a novel adaptive gradient method with gradient compression.
arXiv Detail & Related papers (2021-05-17T13:41:47Z) - ScaleCom: Scalable Sparsified Gradient Compression for
Communication-Efficient Distributed Training [74.43625662170284]
Large-scale distributed training of Deep Neural Networks (DNNs) on state-of-the-art platforms is expected to be severely communication constrained.
We propose a new compression technique that leverages similarity in the gradient distribution amongst learners to provide significantly improved scalability.
We experimentally demonstrate that ScaleCom has small overheads, directly reduces gradient traffic and provides high compression rates (65-400X) and excellent scalability (up to 64 learners and 8-12X larger batch sizes over standard training) without significant accuracy loss.
arXiv Detail & Related papers (2021-04-21T02:22:10Z) - Efficient Distributed Auto-Differentiation [22.192220404846267]
gradient-based algorithms for training large deep neural networks (DNNs) are communication-heavy.
We introduce a surprisingly simple statistic for training distributed DNNs that is more communication-friendly than the gradient.
The process provides the flexibility of averaging gradients during backpropagation, enabling novel flexible training schemas.
arXiv Detail & Related papers (2021-02-18T21:46:27Z) - DeepReduce: A Sparse-tensor Communication Framework for Distributed Deep
Learning [79.89085533866071]
This paper introduces DeepReduce, a versatile framework for the compressed communication of sparse tensors.
DeepReduce decomposes tensors in two sets, values and indices, and allows both independent and combined compression of these sets.
Our experiments with large real models demonstrate that DeepReduce transmits fewer data and imposes lower computational overhead than existing methods.
arXiv Detail & Related papers (2021-02-05T11:31:24Z) - An Efficient Statistical-based Gradient Compression Technique for
Distributed Training Systems [77.88178159830905]
Sparsity-Inducing Distribution-based Compression (SIDCo) is a threshold-based sparsification scheme that enjoys similar threshold estimation quality to deep gradient compression (DGC)
Our evaluation shows SIDCo speeds up training by up to 41:7%, 7:6%, and 1:9% compared to the no-compression baseline, Topk, and DGC compressors, respectively.
arXiv Detail & Related papers (2021-01-26T13:06:00Z) - Sparse Communication for Training Deep Networks [56.441077560085475]
Synchronous gradient descent (SGD) is the most common method used for distributed training of deep learning models.
In this algorithm, each worker shares its local gradients with others and updates the parameters using the average gradients of all workers.
We study several compression schemes and identify how three key parameters affect the performance.
arXiv Detail & Related papers (2020-09-19T17:28:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.