Adan: Adaptive Nesterov Momentum Algorithm for Faster Optimizing Deep
Models
- URL: http://arxiv.org/abs/2208.06677v1
- Date: Sat, 13 Aug 2022 16:04:39 GMT
- Title: Adan: Adaptive Nesterov Momentum Algorithm for Faster Optimizing Deep
Models
- Authors: Xingyu Xie and Pan Zhou and Huan Li and Zhouchen Lin and Shuicheng Yan
- Abstract summary: Adaptive gradient algorithms borrow the moving average idea of heavy ball acceleration to estimate accurate first second-order moments of gradient for accelerating convergence.
Nesterov acceleration converges faster than ball acceleration in theory and also in many empirical cases.
In this paper we develop a new Nesterov momentum estimation (NME) method, which avoids the extra computation and memory overhead of computing gradient at the point.
We show that Adan surpasses the corresponding SoTAs on both vision transformers (ViTs and CNNs) and sets new SoTAs for many popular networks.
- Score: 158.19276683455254
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Adaptive gradient algorithms borrow the moving average idea of heavy ball
acceleration to estimate accurate first- and second-order moments of gradient
for accelerating convergence. However, Nesterov acceleration which converges
faster than heavy ball acceleration in theory and also in many empirical cases
is much less investigated under the adaptive gradient setting. In this work, we
propose the ADAptive Nesterov momentum algorithm, Adan for short, to
effectively speedup the training of deep neural networks. Adan first
reformulates the vanilla Nesterov acceleration to develop a new Nesterov
momentum estimation (NME) method, which avoids the extra computation and memory
overhead of computing gradient at the extrapolation point. Then Adan adopts NME
to estimate the first- and second-order moments of the gradient in adaptive
gradient algorithms for convergence acceleration. Besides, we prove that Adan
finds an $\epsilon$-approximate first-order stationary point within
$O(\epsilon^{-3.5})$ stochastic gradient complexity on the nonconvex stochastic
problems (e.g., deep learning problems), matching the best-known lower bound.
Extensive experimental results show that Adan surpasses the corresponding SoTA
optimizers on both vision transformers (ViTs) and CNNs, and sets new SoTAs for
many popular networks, e.g., ResNet, ConvNext, ViT, Swin, MAE, LSTM,
Transformer-XL, and BERT. More surprisingly, Adan can use half of the training
cost (epochs) of SoTA optimizers to achieve higher or comparable performance on
ViT and ResNet, e.t.c., and also shows great tolerance to a large range of
minibatch size, e.g., from 1k to 32k. We hope Adan can contribute to the
development of deep learning by reducing training cost and relieving
engineering burden of trying different optimizers on various architectures.
Code will be released at https://github.com/sail-sg/Adan.
Related papers
- MARS: Unleashing the Power of Variance Reduction for Training Large Models [56.47014540413659]
We propose a unified training framework for deep neural networks.
We introduce three instances of MARS that leverage preconditioned gradient optimization.
Results indicate that the implementation of MARS consistently outperforms Adam.
arXiv Detail & Related papers (2024-11-15T18:57:39Z) - AdaFisher: Adaptive Second Order Optimization via Fisher Information [22.851200800265914]
We present AdaFisher, an adaptive second-order that leverages a block-diagonal approximation to the Fisher information matrix for adaptive preconditioning gradient.
We demonstrate that AdaFisher outperforms the SOTAs in terms of both accuracy and convergence speed.
arXiv Detail & Related papers (2024-05-26T01:25:02Z) - Conjugate-Gradient-like Based Adaptive Moment Estimation Optimization Algorithm for Deep Learning [2.695991050833627]
We propose a new optimization algorithm named CG-like-Adam for deep learning.
Specifically, both the first-order and the second-order moment estimation of generic Adam are replaced by the conjugate-gradient-like.
Numerical experiments show the superiority of the proposed algorithm based on the CIFAR10/100 dataset.
arXiv Detail & Related papers (2024-04-02T07:57:17Z) - Inverse-Free Fast Natural Gradient Descent Method for Deep Learning [52.0693420699086]
We present a fast natural gradient descent (FNGD) method that only requires inversion during the first epoch.
FNGD exhibits similarities to the average sum in first-order methods, leading to the computational complexity of FNGD being comparable to that of first-order methods.
arXiv Detail & Related papers (2024-03-06T05:13:28Z) - Layer-wise Adaptive Step-Sizes for Stochastic First-Order Methods for
Deep Learning [8.173034693197351]
We propose a new per-layer adaptive step-size procedure for first-order optimization methods in deep learning.
The proposed approach exploits the layer-wise curvature information contained in the diagonal blocks of the Hessian in deep neural networks (DNNs) to compute adaptive step-sizes (i.e., LRs) for each layer.
Numerical experiments show that SGD with momentum and AdamW combined with the proposed per-layer step-sizes are able to choose effective LR schedules.
arXiv Detail & Related papers (2023-05-23T04:12:55Z) - SHINE: SHaring the INverse Estimate from the forward pass for bi-level
optimization and implicit models [15.541264326378366]
In recent years, implicit deep learning has emerged as a method to increase the depth of deep neural networks.
The training is performed as a bi-level problem, and its computational complexity is partially driven by the iterative inversion of a huge Jacobian matrix.
We propose a novel strategy to tackle this computational bottleneck from which many bi-level problems suffer.
arXiv Detail & Related papers (2021-06-01T15:07:34Z) - Single-Timescale Stochastic Nonconvex-Concave Optimization for Smooth
Nonlinear TD Learning [145.54544979467872]
We propose two single-timescale single-loop algorithms that require only one data point each step.
Our results are expressed in a form of simultaneous primal and dual side convergence.
arXiv Detail & Related papers (2020-08-23T20:36:49Z) - Adaptive Gradient Methods Can Be Provably Faster than SGD after Finite
Epochs [25.158203665218164]
We show that adaptive gradient methods can be faster than random shuffling SGD after finite time.
To the best of our knowledge, it is the first to demonstrate that adaptive gradient methods can be faster than SGD after finite time.
arXiv Detail & Related papers (2020-06-12T09:39:47Z) - Communication-Efficient Distributed Stochastic AUC Maximization with
Deep Neural Networks [50.42141893913188]
We study a distributed variable for large-scale AUC for a neural network as with a deep neural network.
Our model requires a much less number of communication rounds and still a number of communication rounds in theory.
Our experiments on several datasets show the effectiveness of our theory and also confirm our theory.
arXiv Detail & Related papers (2020-05-05T18:08:23Z) - Towards Better Understanding of Adaptive Gradient Algorithms in
Generative Adversarial Nets [71.05306664267832]
Adaptive algorithms perform gradient updates using the history of gradients and are ubiquitous in training deep neural networks.
In this paper we analyze a variant of OptimisticOA algorithm for nonconcave minmax problems.
Our experiments show that adaptive GAN non-adaptive gradient algorithms can be observed empirically.
arXiv Detail & Related papers (2019-12-26T22:10:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.