Quantized Adam with Error Feedback
- URL: http://arxiv.org/abs/2004.14180v2
- Date: Tue, 15 Jun 2021 04:41:03 GMT
- Title: Quantized Adam with Error Feedback
- Authors: Congliang Chen, Li Shen, Haozhi Huang, and Wei Liu
- Abstract summary: We present a distributed variant of adaptive gradient method for training deep neural networks.
We incorporate two types of quantization schemes to reduce the communication cost among the workers.
- Score: 11.91306069500983
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we present a distributed variant of adaptive stochastic
gradient method for training deep neural networks in the parameter-server
model. To reduce the communication cost among the workers and server, we
incorporate two types of quantization schemes, i.e., gradient quantization and
weight quantization, into the proposed distributed Adam. Besides, to reduce the
bias introduced by quantization operations, we propose an error-feedback
technique to compensate for the quantized gradient. Theoretically, in the
stochastic nonconvex setting, we show that the distributed adaptive gradient
method with gradient quantization and error-feedback converges to the
first-order stationary point, and that the distributed adaptive gradient method
with weight quantization and error-feedback converges to the point related to
the quantized level under both the single-worker and multi-worker modes. At
last, we apply the proposed distributed adaptive gradient methods to train deep
neural networks. Experimental results demonstrate the efficacy of our methods.
Related papers
- Toward INT4 Fixed-Point Training via Exploring Quantization Error for Gradients [24.973203825917906]
We show that lowering the error for large-magnitude gradients boosts the quantization performance significantly.
We also introduce an interval update algorithm that adjusts the quantization interval adaptively to maintain a small quantization error for large gradients.
arXiv Detail & Related papers (2024-07-17T15:06:12Z) - Flattened one-bit stochastic gradient descent: compressed distributed optimization with controlled variance [55.01966743652196]
We propose a novel algorithm for distributed gradient descent (SGD) with compressed gradient communication in the parameter-server framework.
Our gradient compression technique, named flattened one-bit gradient descent (FO-SGD), relies on two simple algorithmic ideas.
arXiv Detail & Related papers (2024-05-17T21:17:27Z) - Improved Quantization Strategies for Managing Heavy-tailed Gradients in
Distributed Learning [20.91559450517002]
It is observed that gradient distributions are heavy-tailed, with outliers significantly influencing the design of compression strategies.
Existing parameter quantization methods experience performance degradation when this heavy-tailed feature is ignored.
We introduce a novel compression scheme specifically engineered for heavy-tailed gradient gradients, which effectively combines truncation with quantization.
arXiv Detail & Related papers (2024-02-02T06:14:31Z) - On Uniform Scalar Quantization for Learned Image Compression [17.24702997651976]
We find two factors crucial: the discrepancy between the surrogate and rounding, leading to train-test mismatch, and gradient estimation risk due to the surrogate.
Our analyses enlighten us as to two subtle tricks: one is to set an appropriate lower bound for the variance of the estimated quantized latent distribution, which effectively reduces the train-test mismatch.
Our method with the tricks is verified to outperform the existing practices of quantization surrogates on a variety of representative image compression networks.
arXiv Detail & Related papers (2023-09-29T08:23:36Z) - Quantization Aware Factorization for Deep Neural Network Compression [20.04951101799232]
decomposition of convolutional and fully-connected layers is an effective way to reduce parameters and FLOP in neural networks.
A conventional post-training quantization approach applied to networks with weights yields a drop in accuracy.
This motivated us to develop an algorithm that finds decomposed approximation directly with quantized factors.
arXiv Detail & Related papers (2023-08-08T21:38:02Z) - Regularized Vector Quantization for Tokenized Image Synthesis [126.96880843754066]
Quantizing images into discrete representations has been a fundamental problem in unified generative modeling.
deterministic quantization suffers from severe codebook collapse and misalignment with inference stage while quantization suffers from low codebook utilization and reconstruction objective.
This paper presents a regularized vector quantization framework that allows to mitigate perturbed above issues effectively by applying regularization from two perspectives.
arXiv Detail & Related papers (2023-03-11T15:20:54Z) - Cluster-Promoting Quantization with Bit-Drop for Minimizing Network
Quantization Loss [61.26793005355441]
Cluster-Promoting Quantization (CPQ) finds the optimal quantization grids for neural networks.
DropBits is a new bit-drop technique that revises the standard dropout regularization to randomly drop bits instead of neurons.
We experimentally validate our method on various benchmark datasets and network architectures.
arXiv Detail & Related papers (2021-09-05T15:15:07Z) - Path Sample-Analytic Gradient Estimators for Stochastic Binary Networks [78.76880041670904]
In neural networks with binary activations and or binary weights the training by gradient descent is complicated.
We propose a new method for this estimation problem combining sampling and analytic approximation steps.
We experimentally show higher accuracy in gradient estimation and demonstrate a more stable and better performing training in deep convolutional models.
arXiv Detail & Related papers (2020-06-04T21:51:21Z) - Optimal Gradient Quantization Condition for Communication-Efficient
Distributed Training [99.42912552638168]
Communication of gradients is costly for training deep neural networks with multiple devices in computer vision applications.
In this work, we deduce the optimal condition of both the binary and multi-level gradient quantization for textbfANY gradient distribution.
Based on the optimal condition, we develop two novel quantization schemes: biased BinGrad and unbiased ORQ for binary and multi-level gradient quantization respectively.
arXiv Detail & Related papers (2020-02-25T18:28:39Z) - Gradient $\ell_1$ Regularization for Quantization Robustness [70.39776106458858]
We derive a simple regularization scheme that improves robustness against post-training quantization.
By training quantization-ready networks, our approach enables storing a single set of weights that can be quantized on-demand to different bit-widths.
arXiv Detail & Related papers (2020-02-18T12:31:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.