APMSqueeze: A Communication Efficient Adam-Preconditioned Momentum SGD
Algorithm
- URL: http://arxiv.org/abs/2008.11343v2
- Date: Fri, 28 Aug 2020 03:59:08 GMT
- Title: APMSqueeze: A Communication Efficient Adam-Preconditioned Momentum SGD
Algorithm
- Authors: Hanlin Tang, Shaoduo Gan, Samyam Rajbhandari, Xiangru Lian, Ji Liu,
Yuxiong He, Ce Zhang
- Abstract summary: Adam is the important optimization algorithm to guarantee efficiency and accuracy for training many important tasks such as BERT and ImageNet.
We propose a communication efficient bf ADAM bf preconditioned bf Momentum SGD algorithm-- named APMSqueeze-- through an error compensated method compressing gradients.
- Score: 39.110478306078974
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Adam is the important optimization algorithm to guarantee efficiency and
accuracy for training many important tasks such as BERT and ImageNet. However,
Adam is generally not compatible with information (gradient) compression
technology. Therefore, the communication usually becomes the bottleneck for
parallelizing Adam. In this paper, we propose a communication efficient {\bf
A}DAM {\bf p}reconditioned {\bf M}omentum SGD algorithm-- named APMSqueeze--
through an error compensated method compressing gradients. The proposed
algorithm achieves a similar convergence efficiency to Adam in term of epochs,
but significantly reduces the running time per epoch. In terms of end-to-end
performance (including the full-precision pre-condition step), APMSqueeze is
able to provide {sometimes by up to $2-10\times$ speed-up depending on network
bandwidth.} We also conduct theoretical analysis on the convergence and
efficiency.
Related papers
- Communication-Efficient Adam-Type Algorithms for Distributed Data Mining [93.50424502011626]
We propose a class of novel distributed Adam-type algorithms (emphi.e., SketchedAMSGrad) utilizing sketching.
Our new algorithm achieves a fast convergence rate of $O(frac1sqrtnT + frac1(k/d)2 T)$ with the communication cost of $O(k log(d))$ at each iteration.
arXiv Detail & Related papers (2022-10-14T01:42:05Z) - Efficient-Adam: Communication-Efficient Distributed Adam [28.287237692902476]
We present a novel communication complexity.
$efficient distributed Adam model.
Two-way quantization to reduce the cost between the server and workers.
arXiv Detail & Related papers (2022-05-28T16:17:52Z) - Maximizing Communication Efficiency for Large-scale Training via 0/1
Adam [49.426602335460295]
1-bit communication is an effective method to scale up model training, and has been studied extensively on SGD.
We propose 0/1 Adam, which improves upon the state-of-the-art 1-bit Adam via two novel methods.
arXiv Detail & Related papers (2022-02-12T08:02:23Z) - 1-bit Adam: Communication Efficient Large-Scale Training with Adam's
Convergence Speed [39.23129626683372]
Communication has become a major bottleneck on commodity systems with standard TCP interconnects that offer limited network bandwidth.
One of the most effective methods is error-compensated compression, which offers robust convergence speed even under 1-bit compression.
We propose 1-bit Adam that reduces the communication volume by up to $5times$, offers much better scalability, and provides the same convergence speed as uncompressed Adam.
arXiv Detail & Related papers (2021-02-04T21:02:19Z) - Adam$^+$: A Stochastic Method with Adaptive Variance Reduction [56.051001950733315]
Adam is a widely used optimization method for deep learning applications.
We propose a new method named Adam$+$ (pronounced as Adam-plus)
Our empirical studies on various deep learning tasks, including image classification, language modeling, and automatic speech recognition, demonstrate that Adam$+$ significantly outperforms Adam.
arXiv Detail & Related papers (2020-11-24T09:28:53Z) - Adaptive Periodic Averaging: A Practical Approach to Reducing
Communication in Distributed Learning [6.370766463380455]
We show that the optimal averaging period in terms of convergence and communication cost is not a constant, but instead varies over the course of the execution.
We propose a practical algorithm, named ADaptive Periodic parameter averaging SGD (ADPSGD), to achieve a smaller overall variance of model parameters.
arXiv Detail & Related papers (2020-07-13T00:04:55Z) - MaxVA: Fast Adaptation of Step Sizes by Maximizing Observed Variance of
Gradients [112.00379151834242]
We propose adaptive learning rate principle, in which the running mean of squared gradient in Adam is replaced by a weighted mean, with weights chosen to maximize the estimated variance each coordinate.
This results in faster adaptation, which leads more desirable empirical convergence behaviors.
arXiv Detail & Related papers (2020-06-21T21:47:43Z) - Communication-Efficient Distributed Stochastic AUC Maximization with
Deep Neural Networks [50.42141893913188]
We study a distributed variable for large-scale AUC for a neural network as with a deep neural network.
Our model requires a much less number of communication rounds and still a number of communication rounds in theory.
Our experiments on several datasets show the effectiveness of our theory and also confirm our theory.
arXiv Detail & Related papers (2020-05-05T18:08:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.