CADA: Communication-Adaptive Distributed Adam
- URL: http://arxiv.org/abs/2012.15469v1
- Date: Thu, 31 Dec 2020 06:52:18 GMT
- Title: CADA: Communication-Adaptive Distributed Adam
- Authors: Tianyi Chen, Ziye Guo, Yuejiao Sun and Wotao Yin
- Abstract summary: gradient descent (SGD) has taken the stage as the primary workhorse for large-scale machine learning.
This paper proposes an adaptive gradient descent method for distributed machine learning, which can be viewed as the communication-adaptive counterpart of the celebrated Adam method.
- Score: 31.02472517086767
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Stochastic gradient descent (SGD) has taken the stage as the primary
workhorse for large-scale machine learning. It is often used with its adaptive
variants such as AdaGrad, Adam, and AMSGrad. This paper proposes an adaptive
stochastic gradient descent method for distributed machine learning, which can
be viewed as the communication-adaptive counterpart of the celebrated Adam
method - justifying its name CADA. The key components of CADA are a set of new
rules tailored for adaptive stochastic gradients that can be implemented to
save communication upload. The new algorithms adaptively reuse the stale Adam
gradients, thus saving communication, and still have convergence rates
comparable to original Adam. In numerical experiments, CADA achieves impressive
empirical performance in terms of total communication round reduction.
Related papers
- Dissecting adaptive methods in GANs [46.90376306847234]
We study how adaptive methods help train generative adversarial networks (GANs)
By considering an update rule with the magnitude of the Adam update and the normalized direction of SGD, we empirically show that the adaptive magnitude of Adam is key for GAN training.
We prove that in that setting, GANs trained with nSGDA recover all the modes of the true distribution, whereas the same networks trained with SGDA (and any learning rate configuration) suffer from mode collapse.
arXiv Detail & Related papers (2022-10-09T19:00:07Z) - A Control Theoretic Framework for Adaptive Gradient Optimizers in
Machine Learning [0.6526824510982802]
Adaptive gradient methods have become popular in optimizing deep neural networks.
Recent examples include AdaGrad and Adam.
We develop a generic framework for adaptive gradient methods.
arXiv Detail & Related papers (2022-06-04T17:55:33Z) - Understanding the Generalization of Adam in Learning Neural Networks
with Proper Regularization [118.50301177912381]
We show that Adam can converge to different solutions of the objective with provably different errors, even with weight decay globalization.
We show that if convex, and the weight decay regularization is employed, any optimization algorithms including Adam will converge to the same solution.
arXiv Detail & Related papers (2021-08-25T17:58:21Z) - Exploiting Adam-like Optimization Algorithms to Improve the Performance
of Convolutional Neural Networks [82.61182037130405]
gradient descent (SGD) is the main approach for training deep networks.
In this work, we compare Adam based variants based on the difference between the present and the past gradients.
We have tested ensemble of networks and the fusion with ResNet50 trained with gradient descent.
arXiv Detail & Related papers (2021-03-26T18:55:08Z) - Adam revisited: a weighted past gradients perspective [57.54752290924522]
We propose a novel adaptive method weighted adaptive algorithm (WADA) to tackle the non-convergence issues.
We prove that WADA can achieve a weighted data-dependent regret bound, which could be better than the original regret bound of ADAGRAD.
arXiv Detail & Related papers (2021-01-01T14:01:52Z) - Adam$^+$: A Stochastic Method with Adaptive Variance Reduction [56.051001950733315]
Adam is a widely used optimization method for deep learning applications.
We propose a new method named Adam$+$ (pronounced as Adam-plus)
Our empirical studies on various deep learning tasks, including image classification, language modeling, and automatic speech recognition, demonstrate that Adam$+$ significantly outperforms Adam.
arXiv Detail & Related papers (2020-11-24T09:28:53Z) - ClipUp: A Simple and Powerful Optimizer for Distribution-based Policy
Evolution [2.2731500742482305]
We argue that ClipUp is a better choice for distribution-based policy evolution because its working principles are simple and easy to understand.
Experiments show that ClipUp is competitive with Adam despite its simplicity and is effective on challenging continuous control benchmarks.
arXiv Detail & Related papers (2020-08-05T22:46:23Z) - MaxVA: Fast Adaptation of Step Sizes by Maximizing Observed Variance of
Gradients [112.00379151834242]
We propose adaptive learning rate principle, in which the running mean of squared gradient in Adam is replaced by a weighted mean, with weights chosen to maximize the estimated variance each coordinate.
This results in faster adaptation, which leads more desirable empirical convergence behaviors.
arXiv Detail & Related papers (2020-06-21T21:47:43Z) - LASG: Lazily Aggregated Stochastic Gradients for Communication-Efficient
Distributed Learning [47.93365664380274]
This paper targets solving distributed machine learning problems such as federated learning in a communication-efficient fashion.
A class of new gradient descent (SGD) approaches have been developed, which can be viewed as a generalization to the recently developed lazily aggregated gradient (LAG) method.
The key components of LASG are a set of new rules tailored for gradients that can be implemented either to save download, upload, or both.
arXiv Detail & Related papers (2020-02-26T08:58:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.