ScaleCom: Scalable Sparsified Gradient Compression for
Communication-Efficient Distributed Training
- URL: http://arxiv.org/abs/2104.11125v1
- Date: Wed, 21 Apr 2021 02:22:10 GMT
- Title: ScaleCom: Scalable Sparsified Gradient Compression for
Communication-Efficient Distributed Training
- Authors: Chia-Yu Chen, Jiamin Ni, Songtao Lu, Xiaodong Cui, Pin-Yu Chen, Xiao
Sun, Naigang Wang, Swagath Venkataramani, Vijayalakshmi Srinivasan, Wei
Zhang, Kailash Gopalakrishnan
- Abstract summary: Large-scale distributed training of Deep Neural Networks (DNNs) on state-of-the-art platforms is expected to be severely communication constrained.
We propose a new compression technique that leverages similarity in the gradient distribution amongst learners to provide significantly improved scalability.
We experimentally demonstrate that ScaleCom has small overheads, directly reduces gradient traffic and provides high compression rates (65-400X) and excellent scalability (up to 64 learners and 8-12X larger batch sizes over standard training) without significant accuracy loss.
- Score: 74.43625662170284
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large-scale distributed training of Deep Neural Networks (DNNs) on
state-of-the-art platforms is expected to be severely communication
constrained. To overcome this limitation, numerous gradient compression
techniques have been proposed and have demonstrated high compression ratios.
However, most existing methods do not scale well to large scale distributed
systems (due to gradient build-up) and/or fail to evaluate model fidelity (test
accuracy) on large datasets. To mitigate these issues, we propose a new
compression technique, Scalable Sparsified Gradient Compression (ScaleCom),
that leverages similarity in the gradient distribution amongst learners to
provide significantly improved scalability. Using theoretical analysis, we show
that ScaleCom provides favorable convergence guarantees and is compatible with
gradient all-reduce techniques. Furthermore, we experimentally demonstrate that
ScaleCom has small overheads, directly reduces gradient traffic and provides
high compression rates (65-400X) and excellent scalability (up to 64 learners
and 8-12X larger batch sizes over standard training) across a wide range of
applications (image, language, and speech) without significant accuracy loss.
Related papers
- Language Models as Zero-shot Lossless Gradient Compressors: Towards
General Neural Parameter Prior Models [66.1595537904019]
Large language models (LLMs) can act as gradient priors in a zero-shot setting.
We introduce LM-GC, a novel method that integrates LLMs with arithmetic coding.
arXiv Detail & Related papers (2024-09-26T13:38:33Z) - LoCo: Low-Bit Communication Adaptor for Large-scale Model Training [63.040522637816906]
Low-bit communication often degrades training quality due to compression information loss.
We propose Low-bit Communication Adaptor (LoCo), which compensates local local GPU nodes before, without compromising quality.
Experimental results show that across moving large-scale training model frameworks like Megatron-LM and PyTorchs FSDP, LoCo significantly improves compression communication efficiency.
arXiv Detail & Related papers (2024-07-05T13:01:36Z) - Beyond Throughput and Compression Ratios: Towards High End-to-end Utility of Gradient Compression [13.255861297820326]
gradient compression can reduce communicated gradient data volume.
In practice, gradient compression schemes do not achieve acceleration of the training process while also preserving accuracy.
We identify common issues in previous gradient compression systems and evaluation methodologies.
arXiv Detail & Related papers (2024-07-01T15:32:28Z) - Wyner-Ziv Gradient Compression for Federated Learning [4.619828919345114]
Gradient compression is an effective method to reduce communication load by transmitting compressed gradients.
This paper proposes a practical gradient compression scheme for federated learning, which uses historical gradients to compress gradients.
We also implement our gradient quantization method on the real dataset, and the performance of our method is better than the previous schemes.
arXiv Detail & Related papers (2021-11-16T07:55:43Z) - Quantization for Distributed Optimization [0.0]
We present a set of all-reduce gradient compatible compression schemes which significantly reduce the communication overhead while maintaining the performance of vanilla SGD.
Our compression methods perform better than the in-built methods currently offered by the deep learning frameworks.
arXiv Detail & Related papers (2021-09-26T05:16:12Z) - Compressed Communication for Distributed Training: Adaptive Methods and
System [13.244482588437972]
Communication overhead severely hinders the scalability of distributed machine learning systems.
Recently, there has been a growing interest in using gradient compression to reduce the communication overhead.
In this paper, we first introduce a novel adaptive gradient method with gradient compression.
arXiv Detail & Related papers (2021-05-17T13:41:47Z) - An Efficient Statistical-based Gradient Compression Technique for
Distributed Training Systems [77.88178159830905]
Sparsity-Inducing Distribution-based Compression (SIDCo) is a threshold-based sparsification scheme that enjoys similar threshold estimation quality to deep gradient compression (DGC)
Our evaluation shows SIDCo speeds up training by up to 41:7%, 7:6%, and 1:9% compared to the no-compression baseline, Topk, and DGC compressors, respectively.
arXiv Detail & Related papers (2021-01-26T13:06:00Z) - Sparse Communication for Training Deep Networks [56.441077560085475]
Synchronous gradient descent (SGD) is the most common method used for distributed training of deep learning models.
In this algorithm, each worker shares its local gradients with others and updates the parameters using the average gradients of all workers.
We study several compression schemes and identify how three key parameters affect the performance.
arXiv Detail & Related papers (2020-09-19T17:28:11Z) - PowerGossip: Practical Low-Rank Communication Compression in
Decentralized Deep Learning [62.440827696638664]
We introduce a simple algorithm that directly compresses the model differences between neighboring workers.
Inspired by the PowerSGD for centralized deep learning, this algorithm uses power steps to maximize the information transferred per bit.
arXiv Detail & Related papers (2020-08-04T09:14:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.