Adan: Adaptive Nesterov Momentum Algorithm for Faster Optimizing Deep
Models
- URL: http://arxiv.org/abs/2208.06677v1
- Date: Sat, 13 Aug 2022 16:04:39 GMT
- Title: Adan: Adaptive Nesterov Momentum Algorithm for Faster Optimizing Deep
Models
- Authors: Xingyu Xie and Pan Zhou and Huan Li and Zhouchen Lin and Shuicheng Yan
- Abstract summary: Adaptive gradient algorithms borrow the moving average idea of heavy ball acceleration to estimate accurate first second-order moments of gradient for accelerating convergence.
Nesterov acceleration converges faster than ball acceleration in theory and also in many empirical cases.
In this paper we develop a new Nesterov momentum estimation (NME) method, which avoids the extra computation and memory overhead of computing gradient at the point.
We show that Adan surpasses the corresponding SoTAs on both vision transformers (ViTs and CNNs) and sets new SoTAs for many popular networks.
- Score: 158.19276683455254
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Adaptive gradient algorithms borrow the moving average idea of heavy ball
acceleration to estimate accurate first- and second-order moments of gradient
for accelerating convergence. However, Nesterov acceleration which converges
faster than heavy ball acceleration in theory and also in many empirical cases
is much less investigated under the adaptive gradient setting. In this work, we
propose the ADAptive Nesterov momentum algorithm, Adan for short, to
effectively speedup the training of deep neural networks. Adan first
reformulates the vanilla Nesterov acceleration to develop a new Nesterov
momentum estimation (NME) method, which avoids the extra computation and memory
overhead of computing gradient at the extrapolation point. Then Adan adopts NME
to estimate the first- and second-order moments of the gradient in adaptive
gradient algorithms for convergence acceleration. Besides, we prove that Adan
finds an $\epsilon$-approximate first-order stationary point within
$O(\epsilon^{-3.5})$ stochastic gradient complexity on the nonconvex stochastic
problems (e.g., deep learning problems), matching the best-known lower bound.
Extensive experimental results show that Adan surpasses the corresponding SoTA
optimizers on both vision transformers (ViTs) and CNNs, and sets new SoTAs for
many popular networks, e.g., ResNet, ConvNext, ViT, Swin, MAE, LSTM,
Transformer-XL, and BERT. More surprisingly, Adan can use half of the training
cost (epochs) of SoTA optimizers to achieve higher or comparable performance on
ViT and ResNet, e.t.c., and also shows great tolerance to a large range of
minibatch size, e.g., from 1k to 32k. We hope Adan can contribute to the
development of deep learning by reducing training cost and relieving
engineering burden of trying different optimizers on various architectures.
Code will be released at https://github.com/sail-sg/Adan.
Related papers
- Conjugate-Gradient-like Based Adaptive Moment Estimation Optimization Algorithm for Deep Learning [2.695991050833627]
We propose a new optimization algorithm named CG-like-Adam for deep learning.
Specifically, both the first-order and the second-order moment estimation of generic Adam are replaced by the conjugate-gradient-like.
Numerical experiments show the superiority of the proposed algorithm based on the CIFAR10/100 dataset.
arXiv Detail & Related papers (2024-04-02T07:57:17Z) - Inverse-Free Fast Natural Gradient Descent Method for Deep Learning [52.0693420699086]
We present a fast natural gradient descent (FNGD) method that only requires inversion during the first epoch.
FNGD exhibits similarities to the average sum in first-order methods, leading to the computational complexity of FNGD being comparable to that of first-order methods.
arXiv Detail & Related papers (2024-03-06T05:13:28Z) - ELRA: Exponential learning rate adaption gradient descent optimization
method [83.88591755871734]
We present a novel, fast (exponential rate), ab initio (hyper-free) gradient based adaption.
The main idea of the method is to adapt the $alpha by situational awareness.
It can be applied to problems of any dimensions n and scales only linearly.
arXiv Detail & Related papers (2023-09-12T14:36:13Z) - Adaptive Gradient Methods Can Be Provably Faster than SGD after Finite
Epochs [25.158203665218164]
We show that adaptive gradient methods can be faster than random shuffling SGD after finite time.
To the best of our knowledge, it is the first to demonstrate that adaptive gradient methods can be faster than SGD after finite time.
arXiv Detail & Related papers (2020-06-12T09:39:47Z) - Communication-Efficient Distributed Stochastic AUC Maximization with
Deep Neural Networks [50.42141893913188]
We study a distributed variable for large-scale AUC for a neural network as with a deep neural network.
Our model requires a much less number of communication rounds and still a number of communication rounds in theory.
Our experiments on several datasets show the effectiveness of our theory and also confirm our theory.
arXiv Detail & Related papers (2020-05-05T18:08:23Z) - Learning Low-rank Deep Neural Networks via Singular Vector Orthogonality
Regularization and Singular Value Sparsification [53.50708351813565]
We propose SVD training, the first method to explicitly achieve low-rank DNNs during training without applying SVD on every step.
We empirically show that SVD training can significantly reduce the rank of DNN layers and achieve higher reduction on computation load under the same accuracy.
arXiv Detail & Related papers (2020-04-20T02:40:43Z) - Gradient descent with momentum --- to accelerate or to super-accelerate? [0.0]
We show that the algorithm can be improved by extending this acceleration'
Super-acceleration is also easy to incorporate into adaptive algorithms like RMSProp or Adam.
arXiv Detail & Related papers (2020-01-17T18:50:07Z) - Towards Better Understanding of Adaptive Gradient Algorithms in
Generative Adversarial Nets [71.05306664267832]
Adaptive algorithms perform gradient updates using the history of gradients and are ubiquitous in training deep neural networks.
In this paper we analyze a variant of OptimisticOA algorithm for nonconcave minmax problems.
Our experiments show that adaptive GAN non-adaptive gradient algorithms can be observed empirically.
arXiv Detail & Related papers (2019-12-26T22:10:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.