Beyond Throughput and Compression Ratios: Towards High End-to-end Utility of Gradient Compression
- URL: http://arxiv.org/abs/2407.01378v2
- Date: Tue, 29 Oct 2024 11:24:30 GMT
- Title: Beyond Throughput and Compression Ratios: Towards High End-to-end Utility of Gradient Compression
- Authors: Wenchen Han, Shay Vargaftik, Michael Mitzenmacher, Brad Karp, Ran Ben Basat,
- Abstract summary: gradient compression can reduce communicated gradient data volume.
In practice, gradient compression schemes do not achieve acceleration of the training process while also preserving accuracy.
We identify common issues in previous gradient compression systems and evaluation methodologies.
- Score: 13.255861297820326
- License:
- Abstract: Gradient aggregation has long been identified as a major bottleneck in today's large-scale distributed machine learning training systems. One promising solution to mitigate such bottlenecks is gradient compression, directly reducing communicated gradient data volume. However, in practice, many gradient compression schemes do not achieve acceleration of the training process while also preserving accuracy. In this work, we identify common issues in previous gradient compression systems and evaluation methodologies. These include excessive computational overheads; incompatibility with all-reduce; and insufficient evaluation methods, such as not using an end-to-end metric or using a 32-bit baseline instead of the stronger 16-bit baseline. We revisit common compression approaches (sparsification, quantization, and low-rank decomposition) and demonstrate how considering the above issues can lead to minor but strategic design changes, resulting in notably better performance. Our goal is to raise awareness of the need for design and evaluation standards that naturally translate to the end-to-end utility of gradient compression.
Related papers
- Unrolled Compressed Blind-Deconvolution [77.88847247301682]
sparse multichannel blind deconvolution (S-MBD) arises frequently in many engineering applications such as radar/sonar/ultrasound imaging.
We propose a compression method that enables blind recovery from much fewer measurements with respect to the full received signal in time.
arXiv Detail & Related papers (2022-09-28T15:16:58Z) - Communication-Efficient Federated Learning via Quantized Compressed
Sensing [82.10695943017907]
The presented framework consists of gradient compression for wireless devices and gradient reconstruction for a parameter server.
Thanks to gradient sparsification and quantization, our strategy can achieve a higher compression ratio than one-bit gradient compression.
We demonstrate that the framework achieves almost identical performance with the case that performs no compression.
arXiv Detail & Related papers (2021-11-30T02:13:54Z) - Wyner-Ziv Gradient Compression for Federated Learning [4.619828919345114]
Gradient compression is an effective method to reduce communication load by transmitting compressed gradients.
This paper proposes a practical gradient compression scheme for federated learning, which uses historical gradients to compress gradients.
We also implement our gradient quantization method on the real dataset, and the performance of our method is better than the previous schemes.
arXiv Detail & Related papers (2021-11-16T07:55:43Z) - Quantization for Distributed Optimization [0.0]
We present a set of all-reduce gradient compatible compression schemes which significantly reduce the communication overhead while maintaining the performance of vanilla SGD.
Our compression methods perform better than the in-built methods currently offered by the deep learning frameworks.
arXiv Detail & Related papers (2021-09-26T05:16:12Z) - Compressed Communication for Distributed Training: Adaptive Methods and
System [13.244482588437972]
Communication overhead severely hinders the scalability of distributed machine learning systems.
Recently, there has been a growing interest in using gradient compression to reduce the communication overhead.
In this paper, we first introduce a novel adaptive gradient method with gradient compression.
arXiv Detail & Related papers (2021-05-17T13:41:47Z) - ScaleCom: Scalable Sparsified Gradient Compression for
Communication-Efficient Distributed Training [74.43625662170284]
Large-scale distributed training of Deep Neural Networks (DNNs) on state-of-the-art platforms is expected to be severely communication constrained.
We propose a new compression technique that leverages similarity in the gradient distribution amongst learners to provide significantly improved scalability.
We experimentally demonstrate that ScaleCom has small overheads, directly reduces gradient traffic and provides high compression rates (65-400X) and excellent scalability (up to 64 learners and 8-12X larger batch sizes over standard training) without significant accuracy loss.
arXiv Detail & Related papers (2021-04-21T02:22:10Z) - An Efficient Statistical-based Gradient Compression Technique for
Distributed Training Systems [77.88178159830905]
Sparsity-Inducing Distribution-based Compression (SIDCo) is a threshold-based sparsification scheme that enjoys similar threshold estimation quality to deep gradient compression (DGC)
Our evaluation shows SIDCo speeds up training by up to 41:7%, 7:6%, and 1:9% compared to the no-compression baseline, Topk, and DGC compressors, respectively.
arXiv Detail & Related papers (2021-01-26T13:06:00Z) - Analyzing and Mitigating JPEG Compression Defects in Deep Learning [69.04777875711646]
We present a unified study of the effects of JPEG compression on a range of common tasks and datasets.
We show that there is a significant penalty on common performance metrics for high compression.
arXiv Detail & Related papers (2020-11-17T20:32:57Z) - Accordion: Adaptive Gradient Communication via Critical Learning Regime
Identification [12.517161466778655]
Distributed model training suffers from communication bottlenecks due to frequent model updates transmitted across compute nodes.
To alleviate these bottlenecks, practitioners use gradient compression techniques like sparsification, quantization, or low-rank updates.
In this work, we show that such performance degradation due to choosing a high compression ratio is not fundamental.
An adaptive compression strategy can reduce communication while maintaining final test accuracy.
arXiv Detail & Related papers (2020-10-29T16:41:44Z) - Optimal Gradient Compression for Distributed and Federated Learning [9.711326718689492]
Communication between computing nodes in distributed learning is typically an unavoidable burden.
Recent advances in communication-efficient training algorithms have reduced this bottleneck by using compression techniques.
In this paper, we investigate the fundamental trade-off between the number of bits needed to encode compressed vectors and the compression error.
arXiv Detail & Related papers (2020-10-07T07:58:59Z) - Sparse Communication for Training Deep Networks [56.441077560085475]
Synchronous gradient descent (SGD) is the most common method used for distributed training of deep learning models.
In this algorithm, each worker shares its local gradients with others and updates the parameters using the average gradients of all workers.
We study several compression schemes and identify how three key parameters affect the performance.
arXiv Detail & Related papers (2020-09-19T17:28:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.