Quantization for Distributed Optimization
- URL: http://arxiv.org/abs/2109.12497v1
- Date: Sun, 26 Sep 2021 05:16:12 GMT
- Title: Quantization for Distributed Optimization
- Authors: Vineeth S
- Abstract summary: We present a set of all-reduce gradient compatible compression schemes which significantly reduce the communication overhead while maintaining the performance of vanilla SGD.
Our compression methods perform better than the in-built methods currently offered by the deep learning frameworks.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Massive amounts of data have led to the training of large-scale machine
learning models on a single worker inefficient. Distributed machine learning
methods such as Parallel-SGD have received significant interest as a solution
to tackle this problem. However, the performance of distributed systems does
not scale linearly with the number of workers due to the high network
communication cost for synchronizing gradients and parameters. Researchers have
proposed techniques such as quantization and sparsification to alleviate this
problem by compressing the gradients. Most of the compression schemes result in
compressed gradients that cannot be directly aggregated with efficient
protocols such as all-reduce. In this paper, we present a set of all-reduce
compatible gradient compression schemes which significantly reduce the
communication overhead while maintaining the performance of vanilla SGD. We
present the results of our experiments with the CIFAR10 dataset and
observations derived during the process. Our compression methods perform better
than the in-built methods currently offered by the deep learning frameworks.
Code is available at the repository:
\url{https://github.com/vineeths96/Gradient-Compression}.
Related papers
- LoCo: Low-Bit Communication Adaptor for Large-scale Model Training [63.040522637816906]
Low-bit communication often degrades training quality due to compression information loss.
We propose Low-bit Communication Adaptor (LoCo), which compensates local local GPU nodes before, without compromising quality.
Experimental results show that across moving large-scale training model frameworks like Megatron-LM and PyTorchs FSDP, LoCo significantly improves compression communication efficiency.
arXiv Detail & Related papers (2024-07-05T13:01:36Z) - Beyond Throughput and Compression Ratios: Towards High End-to-end Utility of Gradient Compression [13.255861297820326]
gradient compression can reduce communicated gradient data volume.
In practice, gradient compression schemes do not achieve acceleration of the training process while also preserving accuracy.
We identify common issues in previous gradient compression systems and evaluation methodologies.
arXiv Detail & Related papers (2024-07-01T15:32:28Z) - Communication-Efficient Adam-Type Algorithms for Distributed Data Mining [93.50424502011626]
We propose a class of novel distributed Adam-type algorithms (emphi.e., SketchedAMSGrad) utilizing sketching.
Our new algorithm achieves a fast convergence rate of $O(frac1sqrtnT + frac1(k/d)2 T)$ with the communication cost of $O(k log(d))$ at each iteration.
arXiv Detail & Related papers (2022-10-14T01:42:05Z) - Wyner-Ziv Gradient Compression for Federated Learning [4.619828919345114]
Gradient compression is an effective method to reduce communication load by transmitting compressed gradients.
This paper proposes a practical gradient compression scheme for federated learning, which uses historical gradients to compress gradients.
We also implement our gradient quantization method on the real dataset, and the performance of our method is better than the previous schemes.
arXiv Detail & Related papers (2021-11-16T07:55:43Z) - Compressed Communication for Distributed Training: Adaptive Methods and
System [13.244482588437972]
Communication overhead severely hinders the scalability of distributed machine learning systems.
Recently, there has been a growing interest in using gradient compression to reduce the communication overhead.
In this paper, we first introduce a novel adaptive gradient method with gradient compression.
arXiv Detail & Related papers (2021-05-17T13:41:47Z) - ScaleCom: Scalable Sparsified Gradient Compression for
Communication-Efficient Distributed Training [74.43625662170284]
Large-scale distributed training of Deep Neural Networks (DNNs) on state-of-the-art platforms is expected to be severely communication constrained.
We propose a new compression technique that leverages similarity in the gradient distribution amongst learners to provide significantly improved scalability.
We experimentally demonstrate that ScaleCom has small overheads, directly reduces gradient traffic and provides high compression rates (65-400X) and excellent scalability (up to 64 learners and 8-12X larger batch sizes over standard training) without significant accuracy loss.
arXiv Detail & Related papers (2021-04-21T02:22:10Z) - On the Utility of Gradient Compression in Distributed Training Systems [9.017890174185872]
We evaluate the efficacy of gradient compression methods and compare their scalability with optimized implementations of synchronous data-parallel SGD.
Surprisingly, we observe that due to computation overheads introduced by gradient compression, the net speedup over vanilla data-parallel training is marginal, if not negative.
arXiv Detail & Related papers (2021-02-28T15:58:45Z) - An Efficient Statistical-based Gradient Compression Technique for
Distributed Training Systems [77.88178159830905]
Sparsity-Inducing Distribution-based Compression (SIDCo) is a threshold-based sparsification scheme that enjoys similar threshold estimation quality to deep gradient compression (DGC)
Our evaluation shows SIDCo speeds up training by up to 41:7%, 7:6%, and 1:9% compared to the no-compression baseline, Topk, and DGC compressors, respectively.
arXiv Detail & Related papers (2021-01-26T13:06:00Z) - Sparse Communication for Training Deep Networks [56.441077560085475]
Synchronous gradient descent (SGD) is the most common method used for distributed training of deep learning models.
In this algorithm, each worker shares its local gradients with others and updates the parameters using the average gradients of all workers.
We study several compression schemes and identify how three key parameters affect the performance.
arXiv Detail & Related papers (2020-09-19T17:28:11Z) - PowerGossip: Practical Low-Rank Communication Compression in
Decentralized Deep Learning [62.440827696638664]
We introduce a simple algorithm that directly compresses the model differences between neighboring workers.
Inspired by the PowerSGD for centralized deep learning, this algorithm uses power steps to maximize the information transferred per bit.
arXiv Detail & Related papers (2020-08-04T09:14:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.