Related papers: S2 Reducer: High-Performance Sparse Communication to Accelerate Distributed Deep Learning

S2 Reducer: High-Performance Sparse Communication to Accelerate Distributed Deep Learning

URL: http://arxiv.org/abs/2110.02140v1
Date: Tue, 5 Oct 2021 16:14:40 GMT
Title: S2 Reducer: High-Performance Sparse Communication to Accelerate Distributed Deep Learning
Authors: Keshi Ge, Yongquan Fu, Zhiquan Lai, Xiaoge Deng, Dongsheng Li
Abstract summary: We propose Sparse-Sketch Reducer (S2 Reducer), a novel sketch-based sparse gradient aggregation method with convergence guarantees. S2 Reducer reduces the communication cost by only compressing the non-zero gradients with count-sketch and bitmap. Our results show that S2 Reducer converges to the same accuracy, reduces 81% sparse communication overhead, and 1.8$ times $ speedup compared to state-of-the-art approaches.
Score: 11.21739015522637
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Distributed stochastic gradient descent (SGD) approach has been widely used in large-scale deep learning, and the gradient collective method is vital to ensure the training scalability of the distributed deep learning system. Collective communication such as AllReduce has been widely adopted for the distributed SGD process to reduce the communication time. However, AllReduce incurs large bandwidth resources while most gradients are sparse in many cases since many gradient values are zeros and should be efficiently compressed for bandwidth saving. To reduce the sparse gradient communication overhead, we propose Sparse-Sketch Reducer (S2 Reducer), a novel sketch-based sparse gradient aggregation method with convergence guarantees. S2 Reducer reduces the communication cost by only compressing the non-zero gradients with count-sketch and bitmap, and enables the efficient AllReduce operators for parallel SGD training. We perform extensive evaluation against four state-of-the-art methods over five training models. Our results show that S2 Reducer converges to the same accuracy, reduces 81\% sparse communication overhead, and achieves 1.8$ \times $ speedup compared to state-of-the-art approaches.

Related papers

Geminet: Learning the Duality-based Iterative Process for Lightweight Traffic Engineering in Changing Topologies [53.38648279089736]
Geminet is a lightweight and scalable ML-based TE framework that can handle changing topologies.<n>Its neural network size is only 0.04% to 7% of existing schemes.<n>When trained on large-scale topologies, Geminet consumes under 10 GiB of memory, more than eight times less than the 80-plus GiB required by HARP.
arXiv Detail & Related papers (2025-06-30T09:09:50Z)
Communication-Efficient Adaptive Batch Size Strategies for Distributed Local Gradient Methods [17.006352664497122]
Modern deep neural networks often require distributed training with many workers due to their large size. As the number of workers increases, communication overheads become the main bottleneck in data-parallel minibatch gradient methods with per-iteration gradient synchronization. We introduce adaptive batch size strategies for local gradient methods that increase batch sizes adaptively to reduce minibatch gradient variance.
arXiv Detail & Related papers (2024-06-20T02:08:50Z)
Flattened one-bit stochastic gradient descent: compressed distributed optimization with controlled variance [55.01966743652196]
We propose a novel algorithm for distributed gradient descent (SGD) with compressed gradient communication in the parameter-server framework. Our gradient compression technique, named flattened one-bit gradient descent (FO-SGD), relies on two simple algorithmic ideas.
arXiv Detail & Related papers (2024-05-17T21:17:27Z)
RS-DGC: Exploring Neighborhood Statistics for Dynamic Gradient Compression on Remote Sensing Image Interpretation [23.649838489244917]
gradient sparsification has been validated as an effective gradient compression (GC) technique for reducing communication costs. We propose a simple yet effective dynamic gradient compression scheme leveraging neighborhood statistics indicator for RS image interpretation, RS-DGC. We achieve an accuracy improvement of 0.51% with more than 50 times communication compression on the NWPU-RESISC45 dataset.
arXiv Detail & Related papers (2023-12-29T09:24:26Z)
Communication-Efficient Federated Learning via Quantized Compressed Sensing [82.10695943017907]
The presented framework consists of gradient compression for wireless devices and gradient reconstruction for a parameter server. Thanks to gradient sparsification and quantization, our strategy can achieve a higher compression ratio than one-bit gradient compression. We demonstrate that the framework achieves almost identical performance with the case that performs no compression.
arXiv Detail & Related papers (2021-11-30T02:13:54Z)
Compressed Communication for Distributed Training: Adaptive Methods and System [13.244482588437972]
Communication overhead severely hinders the scalability of distributed machine learning systems. Recently, there has been a growing interest in using gradient compression to reduce the communication overhead. In this paper, we first introduce a novel adaptive gradient method with gradient compression.
arXiv Detail & Related papers (2021-05-17T13:41:47Z)
ScaleCom: Scalable Sparsified Gradient Compression for Communication-Efficient Distributed Training [74.43625662170284]
Large-scale distributed training of Deep Neural Networks (DNNs) on state-of-the-art platforms is expected to be severely communication constrained. We propose a new compression technique that leverages similarity in the gradient distribution amongst learners to provide significantly improved scalability. We experimentally demonstrate that ScaleCom has small overheads, directly reduces gradient traffic and provides high compression rates (65-400X) and excellent scalability (up to 64 learners and 8-12X larger batch sizes over standard training) without significant accuracy loss.
arXiv Detail & Related papers (2021-04-21T02:22:10Z)
Efficient Distributed Auto-Differentiation [22.192220404846267]
gradient-based algorithms for training large deep neural networks (DNNs) are communication-heavy. We introduce a surprisingly simple statistic for training distributed DNNs that is more communication-friendly than the gradient. The process provides the flexibility of averaging gradients during backpropagation, enabling novel flexible training schemas.
arXiv Detail & Related papers (2021-02-18T21:46:27Z)
DeepReduce: A Sparse-tensor Communication Framework for Distributed Deep Learning [79.89085533866071]
This paper introduces DeepReduce, a versatile framework for the compressed communication of sparse tensors. DeepReduce decomposes tensors in two sets, values and indices, and allows both independent and combined compression of these sets. Our experiments with large real models demonstrate that DeepReduce transmits fewer data and imposes lower computational overhead than existing methods.
arXiv Detail & Related papers (2021-02-05T11:31:24Z)
An Efficient Statistical-based Gradient Compression Technique for Distributed Training Systems [77.88178159830905]
Sparsity-Inducing Distribution-based Compression (SIDCo) is a threshold-based sparsification scheme that enjoys similar threshold estimation quality to deep gradient compression (DGC) Our evaluation shows SIDCo speeds up training by up to 41:7%, 7:6%, and 1:9% compared to the no-compression baseline, Topk, and DGC compressors, respectively.
arXiv Detail & Related papers (2021-01-26T13:06:00Z)
Sparse Communication for Training Deep Networks [56.441077560085475]
Synchronous gradient descent (SGD) is the most common method used for distributed training of deep learning models. In this algorithm, each worker shares its local gradients with others and updates the parameters using the average gradients of all workers. We study several compression schemes and identify how three key parameters affect the performance.
arXiv Detail & Related papers (2020-09-19T17:28:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.