Sign Bit is Enough: A Learning Synchronization Framework for Multi-hop
All-reduce with Ultimate Compression
- URL: http://arxiv.org/abs/2204.06787v1
- Date: Thu, 14 Apr 2022 06:54:32 GMT
- Title: Sign Bit is Enough: A Learning Synchronization Framework for Multi-hop
All-reduce with Ultimate Compression
- Authors: Feijie Wu, Shiqi He, Song Guo, Zhihao Qu, Haozhao Wang, Weihua Zhuang,
Jie Zhang
- Abstract summary: We implement a sign-bit compression-based learning synchronization framework, Marsit.
It reduces up to 35% training time while preserving the same accuracy as training without compression.
- Score: 17.692238652162203
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Traditional one-bit compressed stochastic gradient descent can not be
directly employed in multi-hop all-reduce, a widely adopted distributed
training paradigm in network-intensive high-performance computing systems such
as public clouds. According to our theoretical findings, due to the cascading
compression, the training process has considerable deterioration on the
convergence performance. To overcome this limitation, we implement a sign-bit
compression-based learning synchronization framework, Marsit. It prevents
cascading compression via an elaborate bit-wise operation for unbiased sign
aggregation and its specific global compensation mechanism for mitigating
compression deviation. The proposed framework retains the same theoretical
convergence rate as non-compression mechanisms. Experimental results
demonstrate that Marsit reduces up to 35% training time while preserving the
same accuracy as training without compression.
Related papers
- Token Compensator: Altering Inference Cost of Vision Transformer without Re-Tuning [63.43972993473501]
Token compression expedites the training and inference of Vision Transformers (ViTs)
However, when applied to downstream tasks, compression degrees are mismatched between training and inference stages.
We propose a model arithmetic framework to decouple the compression degrees between the two stages.
arXiv Detail & Related papers (2024-08-13T10:36:43Z) - Differential error feedback for communication-efficient decentralized learning [48.924131251745266]
We propose a new decentralized communication-efficient learning approach that blends differential quantization with error feedback.
We show that the resulting communication-efficient strategy is stable both in terms of mean-square error and average bit rate.
The results establish that, in the small step-size regime and with a finite number of bits, it is possible to attain the performance achievable in the absence of compression.
arXiv Detail & Related papers (2024-06-26T15:11:26Z) - Activations and Gradients Compression for Model-Parallel Training [85.99744701008802]
We study how simultaneous compression of activations and gradients in model-parallel distributed training setup affects convergence.
We find that gradients require milder compression rates than activations.
Experiments also show that models trained with TopK perform well only when compression is also applied during inference.
arXiv Detail & Related papers (2024-01-15T15:54:54Z) - Inshrinkerator: Compressing Deep Learning Training Checkpoints via Dynamic Quantization [5.648270790530862]
State-of-the-art approaches involve lossy model compression mechanisms, which induce a tradeoff between the resulting model quality (accuracy) and compression ratio.
We make a key enabling observation that the sensitivity of model weights to compression varies during training, and different weights benefit from different quantization levels.
We propose a non-uniform quantization scheme that leverages this variation, an efficient search mechanism that dynamically finds the best quantization configurations, and a quantization-aware delta compression mechanism that rearranges weights to minimize checkpoint differences.
arXiv Detail & Related papers (2023-06-20T18:00:31Z) - Practical Network Acceleration with Tiny Sets [38.742142493108744]
Network compression is effective in accelerating the inference of deep neural networks.
But it often requires finetuning with all the training data to recover from the accuracy loss.
We propose a method named PRACTISE to accelerate the network with tiny sets of training images.
arXiv Detail & Related papers (2022-02-16T05:04:38Z) - Optimal Rate Adaption in Federated Learning with Compressed
Communications [28.16239232265479]
Federated Learning incurs high communication overhead, which can be greatly alleviated by compression for model updates.
tradeoff between compression and model accuracy in the networked environment remains unclear.
We present a framework to maximize the final model accuracy by strategically adjusting the compression each iteration.
arXiv Detail & Related papers (2021-12-13T14:26:15Z) - Compressed Communication for Distributed Training: Adaptive Methods and
System [13.244482588437972]
Communication overhead severely hinders the scalability of distributed machine learning systems.
Recently, there has been a growing interest in using gradient compression to reduce the communication overhead.
In this paper, we first introduce a novel adaptive gradient method with gradient compression.
arXiv Detail & Related papers (2021-05-17T13:41:47Z) - An Efficient Statistical-based Gradient Compression Technique for
Distributed Training Systems [77.88178159830905]
Sparsity-Inducing Distribution-based Compression (SIDCo) is a threshold-based sparsification scheme that enjoys similar threshold estimation quality to deep gradient compression (DGC)
Our evaluation shows SIDCo speeds up training by up to 41:7%, 7:6%, and 1:9% compared to the no-compression baseline, Topk, and DGC compressors, respectively.
arXiv Detail & Related papers (2021-01-26T13:06:00Z) - PowerGossip: Practical Low-Rank Communication Compression in
Decentralized Deep Learning [62.440827696638664]
We introduce a simple algorithm that directly compresses the model differences between neighboring workers.
Inspired by the PowerSGD for centralized deep learning, this algorithm uses power steps to maximize the information transferred per bit.
arXiv Detail & Related papers (2020-08-04T09:14:52Z) - Structured Sparsification with Joint Optimization of Group Convolution
and Channel Shuffle [117.95823660228537]
We propose a novel structured sparsification method for efficient network compression.
The proposed method automatically induces structured sparsity on the convolutional weights.
We also address the problem of inter-group communication with a learnable channel shuffle mechanism.
arXiv Detail & Related papers (2020-02-19T12:03:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.