Improved Quantization Strategies for Managing Heavy-tailed Gradients in
Distributed Learning
- URL: http://arxiv.org/abs/2402.01798v1
- Date: Fri, 2 Feb 2024 06:14:31 GMT
- Title: Improved Quantization Strategies for Managing Heavy-tailed Gradients in
Distributed Learning
- Authors: Guangfeng Yan, Tan Li, Yuanzhang Xiao, Hanxu Hou and Linqi Song
- Abstract summary: It is observed that gradient distributions are heavy-tailed, with outliers significantly influencing the design of compression strategies.
Existing parameter quantization methods experience performance degradation when this heavy-tailed feature is ignored.
We introduce a novel compression scheme specifically engineered for heavy-tailed gradient gradients, which effectively combines truncation with quantization.
- Score: 20.91559450517002
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Gradient compression has surfaced as a key technique to address the challenge
of communication efficiency in distributed learning. In distributed deep
learning, however, it is observed that gradient distributions are heavy-tailed,
with outliers significantly influencing the design of compression strategies.
Existing parameter quantization methods experience performance degradation when
this heavy-tailed feature is ignored. In this paper, we introduce a novel
compression scheme specifically engineered for heavy-tailed gradients, which
effectively combines gradient truncation with quantization. This scheme is
adeptly implemented within a communication-limited distributed Stochastic
Gradient Descent (SGD) framework. We consider a general family of heavy-tail
gradients that follow a power-law distribution, we aim to minimize the error
resulting from quantization, thereby determining optimal values for two
critical parameters: the truncation threshold and the quantization density. We
provide a theoretical analysis on the convergence error bound under both
uniform and non-uniform quantization scenarios. Comparative experiments with
other benchmarks demonstrate the effectiveness of our proposed method in
managing the heavy-tailed gradients in a distributed learning environment.
Related papers
- Mask-Encoded Sparsification: Mitigating Biased Gradients in Communication-Efficient Split Learning [15.78336840511033]
This paper introduces a novel framework designed to achieve a high compression ratio in Split Learning (SL) scenarios.
Our investigations demonstrate that compressing feature maps within SL leads to biased gradients that can negatively impact the convergence rates.
We employ a narrow bit-width encoded mask to compensate for the sparsification error without increasing the order of time complexity.
arXiv Detail & Related papers (2024-08-25T09:30:34Z) - Flattened one-bit stochastic gradient descent: compressed distributed optimization with controlled variance [55.01966743652196]
We propose a novel algorithm for distributed gradient descent (SGD) with compressed gradient communication in the parameter-server framework.
Our gradient compression technique, named flattened one-bit gradient descent (FO-SGD), relies on two simple algorithmic ideas.
arXiv Detail & Related papers (2024-05-17T21:17:27Z) - Adaptive Federated Learning Over the Air [108.62635460744109]
We propose a federated version of adaptive gradient methods, particularly AdaGrad and Adam, within the framework of over-the-air model training.
Our analysis shows that the AdaGrad-based training algorithm converges to a stationary point at the rate of $mathcalO( ln(T) / T 1 - frac1alpha ).
arXiv Detail & Related papers (2024-03-11T09:10:37Z) - Truncated Non-Uniform Quantization for Distributed SGD [17.30572818507568]
We introduce a novel two-stage quantization strategy to enhance the communication efficiency of distributed gradient Descent (SGD)
The proposed method initially employs truncation to mitigate the impact of long-tail noise, followed by a non-uniform quantization of the post-truncation gradients based on their statistical characteristics.
Our proposed algorithm outperforms existing quantization schemes, striking a superior balance between communication efficiency and convergence performance.
arXiv Detail & Related papers (2024-02-02T05:59:48Z) - On Uniform Scalar Quantization for Learned Image Compression [17.24702997651976]
We find two factors crucial: the discrepancy between the surrogate and rounding, leading to train-test mismatch, and gradient estimation risk due to the surrogate.
Our analyses enlighten us as to two subtle tricks: one is to set an appropriate lower bound for the variance of the estimated quantized latent distribution, which effectively reduces the train-test mismatch.
Our method with the tricks is verified to outperform the existing practices of quantization surrogates on a variety of representative image compression networks.
arXiv Detail & Related papers (2023-09-29T08:23:36Z) - Neural Networks with Quantization Constraints [111.42313650830248]
We present a constrained learning approach to quantization training.
We show that the resulting problem is strongly dual and does away with gradient estimations.
We demonstrate that the proposed approach exhibits competitive performance in image classification tasks.
arXiv Detail & Related papers (2022-10-27T17:12:48Z) - Communication-Efficient Federated Learning via Quantized Compressed
Sensing [82.10695943017907]
The presented framework consists of gradient compression for wireless devices and gradient reconstruction for a parameter server.
Thanks to gradient sparsification and quantization, our strategy can achieve a higher compression ratio than one-bit gradient compression.
We demonstrate that the framework achieves almost identical performance with the case that performs no compression.
arXiv Detail & Related papers (2021-11-30T02:13:54Z) - Wyner-Ziv Gradient Compression for Federated Learning [4.619828919345114]
Gradient compression is an effective method to reduce communication load by transmitting compressed gradients.
This paper proposes a practical gradient compression scheme for federated learning, which uses historical gradients to compress gradients.
We also implement our gradient quantization method on the real dataset, and the performance of our method is better than the previous schemes.
arXiv Detail & Related papers (2021-11-16T07:55:43Z) - An Efficient Statistical-based Gradient Compression Technique for
Distributed Training Systems [77.88178159830905]
Sparsity-Inducing Distribution-based Compression (SIDCo) is a threshold-based sparsification scheme that enjoys similar threshold estimation quality to deep gradient compression (DGC)
Our evaluation shows SIDCo speeds up training by up to 41:7%, 7:6%, and 1:9% compared to the no-compression baseline, Topk, and DGC compressors, respectively.
arXiv Detail & Related papers (2021-01-26T13:06:00Z) - Cogradient Descent for Bilinear Optimization [124.45816011848096]
We introduce a Cogradient Descent algorithm (CoGD) to address the bilinear problem.
We solve one variable by considering its coupling relationship with the other, leading to a synchronous gradient descent.
Our algorithm is applied to solve problems with one variable under the sparsity constraint.
arXiv Detail & Related papers (2020-06-16T13:41:54Z) - Quantized Adam with Error Feedback [11.91306069500983]
We present a distributed variant of adaptive gradient method for training deep neural networks.
We incorporate two types of quantization schemes to reduce the communication cost among the workers.
arXiv Detail & Related papers (2020-04-29T13:21:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.