Related papers: ScaleCom: Scalable Sparsified Gradient Compression for Communication-Efficient Distributed Training

ScaleCom: Scalable Sparsified Gradient Compression for Communication-Efficient Distributed Training

URL: http://arxiv.org/abs/2104.11125v1
Date: Wed, 21 Apr 2021 02:22:10 GMT
Title: ScaleCom: Scalable Sparsified Gradient Compression for Communication-Efficient Distributed Training
Authors: Chia-Yu Chen, Jiamin Ni, Songtao Lu, Xiaodong Cui, Pin-Yu Chen, Xiao Sun, Naigang Wang, Swagath Venkataramani, Vijayalakshmi Srinivasan, Wei Zhang, Kailash Gopalakrishnan
Abstract summary: Large-scale distributed training of Deep Neural Networks (DNNs) on state-of-the-art platforms is expected to be severely communication constrained. We propose a new compression technique that leverages similarity in the gradient distribution amongst learners to provide significantly improved scalability. We experimentally demonstrate that ScaleCom has small overheads, directly reduces gradient traffic and provides high compression rates (65-400X) and excellent scalability (up to 64 learners and 8-12X larger batch sizes over standard training) without significant accuracy loss.
Score: 74.43625662170284
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large-scale distributed training of Deep Neural Networks (DNNs) on state-of-the-art platforms is expected to be severely communication constrained. To overcome this limitation, numerous gradient compression techniques have been proposed and have demonstrated high compression ratios. However, most existing methods do not scale well to large scale distributed systems (due to gradient build-up) and/or fail to evaluate model fidelity (test accuracy) on large datasets. To mitigate these issues, we propose a new compression technique, Scalable Sparsified Gradient Compression (ScaleCom), that leverages similarity in the gradient distribution amongst learners to provide significantly improved scalability. Using theoretical analysis, we show that ScaleCom provides favorable convergence guarantees and is compatible with gradient all-reduce techniques. Furthermore, we experimentally demonstrate that ScaleCom has small overheads, directly reduces gradient traffic and provides high compression rates (65-400X) and excellent scalability (up to 64 learners and 8-12X larger batch sizes over standard training) across a wide range of applications (image, language, and speech) without significant accuracy loss.

Related papers

Language Models as Zero-shot Lossless Gradient Compressors: Towards General Neural Parameter Prior Models [66.1595537904019]
Large language models (LLMs) can act as gradient priors in a zero-shot setting. We introduce LM-GC, a novel method that integrates LLMs with arithmetic coding.
arXiv Detail & Related papers (2024-09-26T13:38:33Z)
LoCo: Low-Bit Communication Adaptor for Large-scale Model Training [63.040522637816906]
Low-bit communication often degrades training quality due to compression information loss. We propose Low-bit Communication Adaptor (LoCo), which compensates local local GPU nodes before, without compromising quality. Experimental results show that across moving large-scale training model frameworks like Megatron-LM and PyTorchs FSDP, LoCo significantly improves compression communication efficiency.
arXiv Detail & Related papers (2024-07-05T13:01:36Z)
Beyond Throughput and Compression Ratios: Towards High End-to-end Utility of Gradient Compression [13.255861297820326]
gradient compression can reduce communicated gradient data volume. In practice, gradient compression schemes do not achieve acceleration of the training process while also preserving accuracy. We identify common issues in previous gradient compression systems and evaluation methodologies.
arXiv Detail & Related papers (2024-07-01T15:32:28Z)
Quantize Once, Train Fast: Allreduce-Compatible Compression with Provable Guarantees [53.950234267704]
We introduce Global-QSGD, an All-reduce gradient-compatible quantization method.<n>We show that it accelerates distributed training by up to 3.51% over baseline quantization methods.
arXiv Detail & Related papers (2023-05-29T21:32:15Z)
Wyner-Ziv Gradient Compression for Federated Learning [4.619828919345114]
Gradient compression is an effective method to reduce communication load by transmitting compressed gradients. This paper proposes a practical gradient compression scheme for federated learning, which uses historical gradients to compress gradients. We also implement our gradient quantization method on the real dataset, and the performance of our method is better than the previous schemes.
arXiv Detail & Related papers (2021-11-16T07:55:43Z)
Quantization for Distributed Optimization [0.0]
We present a set of all-reduce gradient compatible compression schemes which significantly reduce the communication overhead while maintaining the performance of vanilla SGD. Our compression methods perform better than the in-built methods currently offered by the deep learning frameworks.
arXiv Detail & Related papers (2021-09-26T05:16:12Z)
Compressed Communication for Distributed Training: Adaptive Methods and System [13.244482588437972]
Communication overhead severely hinders the scalability of distributed machine learning systems. Recently, there has been a growing interest in using gradient compression to reduce the communication overhead. In this paper, we first introduce a novel adaptive gradient method with gradient compression.
arXiv Detail & Related papers (2021-05-17T13:41:47Z)
An Efficient Statistical-based Gradient Compression Technique for Distributed Training Systems [77.88178159830905]
Sparsity-Inducing Distribution-based Compression (SIDCo) is a threshold-based sparsification scheme that enjoys similar threshold estimation quality to deep gradient compression (DGC) Our evaluation shows SIDCo speeds up training by up to 41:7%, 7:6%, and 1:9% compared to the no-compression baseline, Topk, and DGC compressors, respectively.
arXiv Detail & Related papers (2021-01-26T13:06:00Z)
Sparse Communication for Training Deep Networks [56.441077560085475]
Synchronous gradient descent (SGD) is the most common method used for distributed training of deep learning models. In this algorithm, each worker shares its local gradients with others and updates the parameters using the average gradients of all workers. We study several compression schemes and identify how three key parameters affect the performance.
arXiv Detail & Related papers (2020-09-19T17:28:11Z)
PowerGossip: Practical Low-Rank Communication Compression in Decentralized Deep Learning [62.440827696638664]
We introduce a simple algorithm that directly compresses the model differences between neighboring workers. Inspired by the PowerSGD for centralized deep learning, this algorithm uses power steps to maximize the information transferred per bit.
arXiv Detail & Related papers (2020-08-04T09:14:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.