Related papers: CADA: Communication-Adaptive Distributed Adam

CADA: Communication-Adaptive Distributed Adam

URL: http://arxiv.org/abs/2012.15469v1
Date: Thu, 31 Dec 2020 06:52:18 GMT
Title: CADA: Communication-Adaptive Distributed Adam
Authors: Tianyi Chen, Ziye Guo, Yuejiao Sun and Wotao Yin
Abstract summary: gradient descent (SGD) has taken the stage as the primary workhorse for large-scale machine learning. This paper proposes an adaptive gradient descent method for distributed machine learning, which can be viewed as the communication-adaptive counterpart of the celebrated Adam method.
Score: 31.02472517086767
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Stochastic gradient descent (SGD) has taken the stage as the primary workhorse for large-scale machine learning. It is often used with its adaptive variants such as AdaGrad, Adam, and AMSGrad. This paper proposes an adaptive stochastic gradient descent method for distributed machine learning, which can be viewed as the communication-adaptive counterpart of the celebrated Adam method - justifying its name CADA. The key components of CADA are a set of new rules tailored for adaptive stochastic gradients that can be implemented to save communication upload. The new algorithms adaptively reuse the stale Adam gradients, thus saving communication, and still have convergence rates comparable to original Adam. In numerical experiments, CADA achieves impressive empirical performance in terms of total communication round reduction.

Related papers

Towards Simple and Provable Parameter-Free Adaptive Gradient Methods [56.060918447252625]
We present AdaGrad++ and Adam++, novel and simple parameter-free variants of AdaGrad and Adam with convergence guarantees. We prove that AdaGrad++ achieves comparable convergence rates to AdaGrad in convex optimization without predefined learning rate assumptions. Similarly, Adam++ matches the convergence rate of Adam without relying on any conditions on the learning rates.
arXiv Detail & Related papers (2024-12-27T04:22:02Z)
Dissecting adaptive methods in GANs [46.90376306847234]
We study how adaptive methods help train generative adversarial networks (GANs) By considering an update rule with the magnitude of the Adam update and the normalized direction of SGD, we empirically show that the adaptive magnitude of Adam is key for GAN training. We prove that in that setting, GANs trained with nSGDA recover all the modes of the true distribution, whereas the same networks trained with SGDA (and any learning rate configuration) suffer from mode collapse.
arXiv Detail & Related papers (2022-10-09T19:00:07Z)
A Control Theoretic Framework for Adaptive Gradient Optimizers in Machine Learning [0.6526824510982802]
Adaptive gradient methods have become popular in optimizing deep neural networks. Recent examples include AdaGrad and Adam. We develop a generic framework for adaptive gradient methods.
arXiv Detail & Related papers (2022-06-04T17:55:33Z)
Understanding the Generalization of Adam in Learning Neural Networks with Proper Regularization [118.50301177912381]
We show that Adam can converge to different solutions of the objective with provably different errors, even with weight decay globalization. We show that if convex, and the weight decay regularization is employed, any optimization algorithms including Adam will converge to the same solution.
arXiv Detail & Related papers (2021-08-25T17:58:21Z)
Exploiting Adam-like Optimization Algorithms to Improve the Performance of Convolutional Neural Networks [82.61182037130405]
gradient descent (SGD) is the main approach for training deep networks. In this work, we compare Adam based variants based on the difference between the present and the past gradients. We have tested ensemble of networks and the fusion with ResNet50 trained with gradient descent.
arXiv Detail & Related papers (2021-03-26T18:55:08Z)
Adam revisited: a weighted past gradients perspective [57.54752290924522]
We propose a novel adaptive method weighted adaptive algorithm (WADA) to tackle the non-convergence issues. We prove that WADA can achieve a weighted data-dependent regret bound, which could be better than the original regret bound of ADAGRAD.
arXiv Detail & Related papers (2021-01-01T14:01:52Z)
Adam$^+$: A Stochastic Method with Adaptive Variance Reduction [56.051001950733315]
Adam is a widely used optimization method for deep learning applications. We propose a new method named Adam$+$ (pronounced as Adam-plus) Our empirical studies on various deep learning tasks, including image classification, language modeling, and automatic speech recognition, demonstrate that Adam$+$ significantly outperforms Adam.
arXiv Detail & Related papers (2020-11-24T09:28:53Z)
ClipUp: A Simple and Powerful Optimizer for Distribution-based Policy Evolution [2.2731500742482305]
We argue that ClipUp is a better choice for distribution-based policy evolution because its working principles are simple and easy to understand. Experiments show that ClipUp is competitive with Adam despite its simplicity and is effective on challenging continuous control benchmarks.
arXiv Detail & Related papers (2020-08-05T22:46:23Z)
MaxVA: Fast Adaptation of Step Sizes by Maximizing Observed Variance of Gradients [112.00379151834242]
We propose adaptive learning rate principle, in which the running mean of squared gradient in Adam is replaced by a weighted mean, with weights chosen to maximize the estimated variance each coordinate. This results in faster adaptation, which leads more desirable empirical convergence behaviors.
arXiv Detail & Related papers (2020-06-21T21:47:43Z)
LASG: Lazily Aggregated Stochastic Gradients for Communication-Efficient Distributed Learning [47.93365664380274]
This paper targets solving distributed machine learning problems such as federated learning in a communication-efficient fashion. A class of new gradient descent (SGD) approaches have been developed, which can be viewed as a generalization to the recently developed lazily aggregated gradient (LAG) method. The key components of LASG are a set of new rules tailored for gradients that can be implemented either to save download, upload, or both.
arXiv Detail & Related papers (2020-02-26T08:58:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.