Efficient Distributed Training through Gradient Compression with Sparsification and Quantization Techniques
- URL: http://arxiv.org/abs/2502.07634v1
- Date: Sat, 07 Dec 2024 22:55:55 GMT
- Title: Efficient Distributed Training through Gradient Compression with Sparsification and Quantization Techniques
- Authors: Shruti Singh, Shantanu Kumar,
- Abstract summary: Using top-k and DGC at 50 times compression yields performance improvements, reducing perplexity by up to 0.06 compared to baseline.<n>Communication times are reduced across all compression methods, with top-k and DGC decreasing communication to negligible levels at high compression ratios.
- Score: 3.6481248057068174
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This study investigates the impact of gradient compression on distributed training performance, focusing on sparsification and quantization techniques, including top-k, DGC, and QSGD. In baseline experiments, random-k compression results in severe performance degradation, highlighting its inefficacy. In contrast, using top-k and DGC at 50 times compression yields performance improvements, reducing perplexity by up to 0.06 compared to baseline. Experiments across 1, 2, and 4 workers demonstrate that conservative sparsification can have a regularizing effect, especially for smaller models, while compression ratios above 5000 times impair performance, particularly for DGC. Communication times are reduced across all compression methods, with top-k and DGC decreasing communication to negligible levels at high compression ratios. However, increased computation times offset this efficiency for top-k due to sorting demands, making it less scalable than DGC or QSGD. In convergence tests, sparsification techniques show accelerated convergence, requiring fewer epochs than the baseline, which has implications for computational savings. Although precision trade-offs emerge, floating point errors are mitigated by compression. This study's findings underscore the need to tune hyperparameters specifically for each compression technique to achieve optimal model performance, especially in distributed training systems.
Related papers
- LoCo: Low-Bit Communication Adaptor for Large-scale Model Training [63.040522637816906]
Low-bit communication often degrades training quality due to compression information loss.<n>We propose Low-bit Communication Adaptor (LoCo), which compensates local local GPU nodes before, without compromising quality.<n> Experimental results show that across moving large-scale training model frameworks like Megatron-LM and PyTorchs FSDP, LoCo significantly improves compression communication efficiency.
arXiv Detail & Related papers (2024-07-05T13:01:36Z) - Activations and Gradients Compression for Model-Parallel Training [85.99744701008802]
We study how simultaneous compression of activations and gradients in model-parallel distributed training setup affects convergence.
We find that gradients require milder compression rates than activations.
Experiments also show that models trained with TopK perform well only when compression is also applied during inference.
arXiv Detail & Related papers (2024-01-15T15:54:54Z) - DiffRate : Differentiable Compression Rate for Efficient Vision
Transformers [98.33906104846386]
Token compression aims to speed up large-scale vision transformers (e.g. ViTs) by pruning (dropping) or merging tokens.
DiffRate is a novel token compression method that has several appealing properties prior arts do not have.
arXiv Detail & Related papers (2023-05-29T10:15:19Z) - L-GreCo: Layerwise-Adaptive Gradient Compression for Efficient and
Accurate Deep Learning [24.712888488317816]
We provide a framework for adapting the degree of compression across the model's layers dynamically during training.
Our framework, called L-GreCo, is based on an adaptive algorithm, which automatically picks the optimal compression parameters for model layers.
arXiv Detail & Related papers (2022-10-31T14:37:41Z) - Quantization for Distributed Optimization [0.0]
We present a set of all-reduce gradient compatible compression schemes which significantly reduce the communication overhead while maintaining the performance of vanilla SGD.
Our compression methods perform better than the in-built methods currently offered by the deep learning frameworks.
arXiv Detail & Related papers (2021-09-26T05:16:12Z) - Towards Compact CNNs via Collaborative Compression [166.86915086497433]
We propose a Collaborative Compression scheme, which joints channel pruning and tensor decomposition to compress CNN models.
We achieve 52.9% FLOPs reduction by removing 48.4% parameters on ResNet-50 with only a Top-1 accuracy drop of 0.56% on ImageNet 2012.
arXiv Detail & Related papers (2021-05-24T12:07:38Z) - Compressed Communication for Distributed Training: Adaptive Methods and
System [13.244482588437972]
Communication overhead severely hinders the scalability of distributed machine learning systems.
Recently, there has been a growing interest in using gradient compression to reduce the communication overhead.
In this paper, we first introduce a novel adaptive gradient method with gradient compression.
arXiv Detail & Related papers (2021-05-17T13:41:47Z) - ScaleCom: Scalable Sparsified Gradient Compression for
Communication-Efficient Distributed Training [74.43625662170284]
Large-scale distributed training of Deep Neural Networks (DNNs) on state-of-the-art platforms is expected to be severely communication constrained.
We propose a new compression technique that leverages similarity in the gradient distribution amongst learners to provide significantly improved scalability.
We experimentally demonstrate that ScaleCom has small overheads, directly reduces gradient traffic and provides high compression rates (65-400X) and excellent scalability (up to 64 learners and 8-12X larger batch sizes over standard training) without significant accuracy loss.
arXiv Detail & Related papers (2021-04-21T02:22:10Z) - An Efficient Statistical-based Gradient Compression Technique for
Distributed Training Systems [77.88178159830905]
Sparsity-Inducing Distribution-based Compression (SIDCo) is a threshold-based sparsification scheme that enjoys similar threshold estimation quality to deep gradient compression (DGC)
Our evaluation shows SIDCo speeds up training by up to 41:7%, 7:6%, and 1:9% compared to the no-compression baseline, Topk, and DGC compressors, respectively.
arXiv Detail & Related papers (2021-01-26T13:06:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.