Towards Practical Adam: Non-Convexity, Convergence Theory, and
Mini-Batch Acceleration
- URL: http://arxiv.org/abs/2101.05471v1
- Date: Thu, 14 Jan 2021 06:42:29 GMT
- Title: Towards Practical Adam: Non-Convexity, Convergence Theory, and
Mini-Batch Acceleration
- Authors: Congliang Chen, Li Shen, Fangyu Zou, Wei Liu
- Abstract summary: Adam is one of the most influential adaptive algorithms for training deep neural networks.
Existing approaches, such as decreasing an adaptive learning rate, adopting a big batch size, have tried to promote Adam-type algorithms to converge.
We introduce an alternative easy-to-check sufficient condition, which merely depends on the parameters of historical base learning rate.
- Score: 12.744658958445024
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Adam is one of the most influential adaptive stochastic algorithms for
training deep neural networks, which has been pointed out to be divergent even
in the simple convex setting via a few simple counterexamples. Many attempts,
such as decreasing an adaptive learning rate, adopting a big batch size,
incorporating a temporal decorrelation technique, seeking an analogous
surrogate, \textit{etc.}, have been tried to promote Adam-type algorithms to
converge. In contrast with existing approaches, we introduce an alternative
easy-to-check sufficient condition, which merely depends on the parameters of
the base learning rate and combinations of historical second-order moments, to
guarantee the global convergence of generic Adam for solving large-scale
non-convex stochastic optimization. This observation coupled with this
sufficient condition gives much deeper interpretations on the divergence of
Adam. On the other hand, in practice, mini-Adam and distributed-Adam are widely
used without theoretical guarantee, we further give an analysis on how will the
batch size or the number of nodes in the distributed system will affect the
convergence of Adam, which theoretically shows that mini-batch and distributed
Adam can be linearly accelerated by using a larger mini-batch size or more
number of nodes. At last, we apply the generic Adam and mini-batch Adam with a
sufficient condition for solving the counterexample and training several
different neural networks on various real-world datasets. Experimental results
are exactly in accord with our theoretical analysis.
Related papers
- Towards Communication-efficient Federated Learning via Sparse and Aligned Adaptive Optimization [65.85963235502322]
Federated Adam (FedAdam) algorithms suffer from a threefold increase in uplink communication overhead.
We propose a novel sparse FedAdam algorithm called FedAdam-SSM, wherein distributed devices sparsify the updates local model parameters and moment estimates.
By minimizing the divergence bound between the model trained by FedAdam-SSM and centralized Adam, we optimize the SSM to mitigate the learning performance degradation caused by sparsification error.
arXiv Detail & Related papers (2024-05-28T07:56:49Z) - Convergence of Adam Under Relaxed Assumptions [72.24779199744954]
We show that Adam converges to $epsilon$-stationary points with $O(epsilon-4)$ gradient complexity under far more realistic conditions.
We also propose a variance-reduced version of Adam with an accelerated gradient complexity of $O(epsilon-3)$.
arXiv Detail & Related papers (2023-04-27T06:27:37Z) - Efficient-Adam: Communication-Efficient Distributed Adam [28.287237692902476]
We present a novel communication complexity.
$efficient distributed Adam model.
Two-way quantization to reduce the cost between the server and workers.
arXiv Detail & Related papers (2022-05-28T16:17:52Z) - A Novel Convergence Analysis for Algorithms of the Adam Family [105.22760323075008]
We present a generic proof of convergence for a family of Adam-style methods including Adam, AMSGrad, Adabound, etc.
Our analysis is so simple and generic that it can be leveraged to establish the convergence for solving a broader family of non- compositional optimization problems.
arXiv Detail & Related papers (2021-12-07T02:47:58Z) - On the One-sided Convergence of Adam-type Algorithms in Non-convex
Non-concave Min-max Optimization [43.504548777955854]
We show that Adam-type algorithms converge to a one-sided first order stationary points in min-max optimization problems under the one-sided MVI condition.
We also empirically verify that such one-sided MVI condition is satisfied for standard GANs after trained over standard data.
arXiv Detail & Related papers (2021-09-29T06:38:39Z) - Understanding the Generalization of Adam in Learning Neural Networks
with Proper Regularization [118.50301177912381]
We show that Adam can converge to different solutions of the objective with provably different errors, even with weight decay globalization.
We show that if convex, and the weight decay regularization is employed, any optimization algorithms including Adam will converge to the same solution.
arXiv Detail & Related papers (2021-08-25T17:58:21Z) - Adam$^+$: A Stochastic Method with Adaptive Variance Reduction [56.051001950733315]
Adam is a widely used optimization method for deep learning applications.
We propose a new method named Adam$+$ (pronounced as Adam-plus)
Our empirical studies on various deep learning tasks, including image classification, language modeling, and automatic speech recognition, demonstrate that Adam$+$ significantly outperforms Adam.
arXiv Detail & Related papers (2020-11-24T09:28:53Z) - Adam with Bandit Sampling for Deep Learning [18.033149110113378]
We propose a generalization of Adam, called Adambs, that allows us to adapt to different training examples.
Experiments on various models and datasets demonstrate Adambs's fast convergence in practice.
arXiv Detail & Related papers (2020-10-24T21:01:26Z) - Coded Stochastic ADMM for Decentralized Consensus Optimization with Edge
Computing [113.52575069030192]
Big data, including applications with high security requirements, are often collected and stored on multiple heterogeneous devices, such as mobile devices, drones and vehicles.
Due to the limitations of communication costs and security requirements, it is of paramount importance to extract information in a decentralized manner instead of aggregating data to a fusion center.
We consider the problem of learning model parameters in a multi-agent system with data locally processed via distributed edge nodes.
A class of mini-batch alternating direction method of multipliers (ADMM) algorithms is explored to develop the distributed learning model.
arXiv Detail & Related papers (2020-10-02T10:41:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.