Maximizing Communication Efficiency for Large-scale Training via 0/1
Adam
- URL: http://arxiv.org/abs/2202.06009v1
- Date: Sat, 12 Feb 2022 08:02:23 GMT
- Title: Maximizing Communication Efficiency for Large-scale Training via 0/1
Adam
- Authors: Yucheng Lu, Conglong Li, Minjia Zhang, Christopher De Sa, Yuxiong He
- Abstract summary: 1-bit communication is an effective method to scale up model training, and has been studied extensively on SGD.
We propose 0/1 Adam, which improves upon the state-of-the-art 1-bit Adam via two novel methods.
- Score: 49.426602335460295
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: 1-bit communication is an effective method to scale up model training, and
has been studied extensively on SGD. Its benefits, however, remain an open
question on Adam-based model training (e.g. BERT and GPT). In this paper, we
propose 0/1 Adam, which improves upon the state-of-the-art 1-bit Adam via two
novel designs: (1) adaptive variance state freezing, which eliminates the
requirement of running expensive full-precision communication at early stage of
training; (2) 1-bit sync, which allows skipping communication rounds with
bit-free synchronization over Adam's optimizer states, momentum and variance.
In theory, we provide convergence analysis for 0/1 Adam on smooth non-convex
objectives, and show the complexity bound is better than original Adam under
certain conditions. On various benchmarks such as BERT-Base/Large pretraining
and ImageNet, we demonstrate on up to 128 GPUs that 0/1 Adam is able to reduce
up to 90% of data volume, 54% of communication rounds, and achieve up to 2X
higher throughput compared to the state-of-the-art 1-bit Adam while enjoying
the same statistical convergence speed and end-to-end model accuracy on GLUE
dataset and ImageNet validation set.
Related papers
- Towards Communication-efficient Federated Learning via Sparse and Aligned Adaptive Optimization [65.85963235502322]
Federated Adam (FedAdam) algorithms suffer from a threefold increase in uplink communication overhead.
We propose a novel sparse FedAdam algorithm called FedAdam-SSM, wherein distributed devices sparsify the updates local model parameters and moment estimates.
By minimizing the divergence bound between the model trained by FedAdam-SSM and centralized Adam, we optimize the SSM to mitigate the learning performance degradation caused by sparsification error.
arXiv Detail & Related papers (2024-05-28T07:56:49Z) - Provable Adaptivity of Adam under Non-uniform Smoothness [79.25087082434975]
Adam is widely adopted in practical applications due to its fast convergence.
Existing convergence analyses for Adam rely on the bounded smoothness assumption.
This paper studies the convergence of randomly reshuffled Adam with diminishing learning rate.
arXiv Detail & Related papers (2022-08-21T14:57:47Z) - A Novel Convergence Analysis for Algorithms of the Adam Family [105.22760323075008]
We present a generic proof of convergence for a family of Adam-style methods including Adam, AMSGrad, Adabound, etc.
Our analysis is so simple and generic that it can be leveraged to establish the convergence for solving a broader family of non- compositional optimization problems.
arXiv Detail & Related papers (2021-12-07T02:47:58Z) - How Do Adam and Training Strategies Help BNNs Optimization? [50.22482900678071]
We show that Adam is better equipped to handle the rugged loss surface of BNNs and reaches a better optimum with higher generalization ability.
We derive a simple training scheme, building on existing Adam-based optimization, which achieves 70.5% top-1 accuracy on the ImageNet dataset.
arXiv Detail & Related papers (2021-06-21T17:59:51Z) - 1-bit Adam: Communication Efficient Large-Scale Training with Adam's
Convergence Speed [39.23129626683372]
Communication has become a major bottleneck on commodity systems with standard TCP interconnects that offer limited network bandwidth.
One of the most effective methods is error-compensated compression, which offers robust convergence speed even under 1-bit compression.
We propose 1-bit Adam that reduces the communication volume by up to $5times$, offers much better scalability, and provides the same convergence speed as uncompressed Adam.
arXiv Detail & Related papers (2021-02-04T21:02:19Z) - Towards Practical Adam: Non-Convexity, Convergence Theory, and
Mini-Batch Acceleration [12.744658958445024]
Adam is one of the most influential adaptive algorithms for training deep neural networks.
Existing approaches, such as decreasing an adaptive learning rate, adopting a big batch size, have tried to promote Adam-type algorithms to converge.
We introduce an alternative easy-to-check sufficient condition, which merely depends on the parameters of historical base learning rate.
arXiv Detail & Related papers (2021-01-14T06:42:29Z) - Adam$^+$: A Stochastic Method with Adaptive Variance Reduction [56.051001950733315]
Adam is a widely used optimization method for deep learning applications.
We propose a new method named Adam$+$ (pronounced as Adam-plus)
Our empirical studies on various deep learning tasks, including image classification, language modeling, and automatic speech recognition, demonstrate that Adam$+$ significantly outperforms Adam.
arXiv Detail & Related papers (2020-11-24T09:28:53Z) - APMSqueeze: A Communication Efficient Adam-Preconditioned Momentum SGD
Algorithm [39.110478306078974]
Adam is the important optimization algorithm to guarantee efficiency and accuracy for training many important tasks such as BERT and ImageNet.
We propose a communication efficient bf ADAM bf preconditioned bf Momentum SGD algorithm-- named APMSqueeze-- through an error compensated method compressing gradients.
arXiv Detail & Related papers (2020-08-26T02:20:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.