ADAHESSIAN: An Adaptive Second Order Optimizer for Machine Learning
- URL: http://arxiv.org/abs/2006.00719v3
- Date: Thu, 29 Apr 2021 00:52:09 GMT
- Title: ADAHESSIAN: An Adaptive Second Order Optimizer for Machine Learning
- Authors: Zhewei Yao, Amir Gholami, Sheng Shen, Mustafa Mustafa, Kurt Keutzer,
Michael W. Mahoney
- Abstract summary: We introduce ADAHESSIAN, a second order optimization algorithm which dynamically incorporates the curvature of the loss function via ADAptive estimates.
We show that ADAHESSIAN achieves new state-of-the-art results by a large margin as compared to other adaptive optimization methods.
- Score: 91.13797346047984
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce ADAHESSIAN, a second order stochastic optimization algorithm
which dynamically incorporates the curvature of the loss function via ADAptive
estimates of the HESSIAN. Second order algorithms are among the most powerful
optimization algorithms with superior convergence properties as compared to
first order methods such as SGD and Adam. The main disadvantage of traditional
second order methods is their heavier per iteration computation and poor
accuracy as compared to first order methods. To address these, we incorporate
several novel approaches in ADAHESSIAN, including: (i) a fast Hutchinson based
method to approximate the curvature matrix with low computational overhead;
(ii) a root-mean-square exponential moving average to smooth out variations of
the Hessian diagonal across different iterations; and (iii) a block diagonal
averaging to reduce the variance of Hessian diagonal elements. We show that
ADAHESSIAN achieves new state-of-the-art results by a large margin as compared
to other adaptive optimization methods, including variants of Adam. In
particular, we perform extensive tests on CV, NLP, and recommendation system
tasks and find that ADAHESSIAN: (i) achieves 1.80%/1.45% higher accuracy on
ResNets20/32 on Cifar10, and 5.55% higher accuracy on ImageNet as compared to
Adam; (ii) outperforms AdamW for transformers by 0.13/0.33 BLEU score on
IWSLT14/WMT14 and 2.7/1.0 PPL on PTB/Wikitext-103; (iii) outperforms AdamW for
SqueezeBert by 0.41 points on GLUE; and (iv) achieves 0.032% better score than
Adagrad for DLRM on the Criteo Ad Kaggle dataset. Importantly, we show that the
cost per iteration of ADAHESSIAN is comparable to first order methods, and that
it exhibits robustness towards its hyperparameters.
Related papers
- Adapprox: Adaptive Approximation in Adam Optimization via Randomized Low-Rank Matrices [24.319712013824876]
Adapprox is a novel approach that employs randomized low-rank matrix approximation for a more accurate approximation of Adam's second moment.
In GPT-2 training and downstream tasks, Adapprox surpasses AdamW by achieving 34.5% to 49.9% memory savings.
arXiv Detail & Related papers (2024-03-22T05:23:31Z) - Studying K-FAC Heuristics by Viewing Adam through a Second-Order Lens [34.72514951778262]
We study AdamQLR: an optimiser combining damping and learning rate selection techniques from K-FAC.
We evaluate AdamQLR on a range of regression and classification tasks at various scales.
Finding an untuned AdamQLR setting can achieve comparable performance vs runtime to tuned benchmarks.
arXiv Detail & Related papers (2023-10-23T14:06:46Z) - Bidirectional Looking with A Novel Double Exponential Moving Average to
Adaptive and Non-adaptive Momentum Optimizers [109.52244418498974]
We propose a novel textscAdmeta (textbfADouble exponential textbfMov averagtextbfE textbfAdaptive and non-adaptive momentum) framework.
We provide two implementations, textscAdmetaR and textscAdmetaS, the former based on RAdam and the latter based on SGDM.
arXiv Detail & Related papers (2023-07-02T18:16:06Z) - Efficient first-order predictor-corrector multiple objective
optimization for fair misinformation detection [5.139559672771439]
Multiple-objective optimization (MOO) aims to simultaneously optimize multiple conflicting objectives and has found important applications in machine learning.
We propose a Gauss-Newton approximation that only scales linearly, and that requires only first-order inner-product per iteration.
The innovations make predictor-corrector possible for large networks.
arXiv Detail & Related papers (2022-09-15T12:32:15Z) - ZARTS: On Zero-order Optimization for Neural Architecture Search [94.41017048659664]
Differentiable architecture search (DARTS) has been a popular one-shot paradigm for NAS due to its high efficiency.
This work turns to zero-order optimization and proposes a novel NAS scheme, called ZARTS, to search without enforcing the above approximation.
In particular, results on 12 benchmarks verify the outstanding robustness of ZARTS, where the performance of DARTS collapses due to its known instability issue.
arXiv Detail & Related papers (2021-10-10T09:35:15Z) - BiAdam: Fast Adaptive Bilevel Optimization Methods [104.96004056928474]
Bilevel optimization has attracted increased interest in machine learning due to its many applications.
We provide a useful analysis framework for both the constrained and unconstrained optimization.
arXiv Detail & Related papers (2021-06-21T20:16:40Z) - Correcting Momentum with Second-order Information [50.992629498861724]
We develop a new algorithm for non-critical optimization that finds an $O(epsilon)$epsilon point in the optimal product.
We validate our results on a variety of large-scale deep learning benchmarks and architectures.
arXiv Detail & Related papers (2021-03-04T19:01:20Z) - Apollo: An Adaptive Parameter-wise Diagonal Quasi-Newton Method for
Nonconvex Stochastic Optimization [17.219297142656828]
We introduce a quasi-Newton method for non-github optimization, which dynamically incorporates the curvature of the loss by the Hessian.
The implementation of the algorithm is available at https://www.xuezmax.com/XuezMax/apollo.
arXiv Detail & Related papers (2020-09-28T19:07:02Z) - MaxVA: Fast Adaptation of Step Sizes by Maximizing Observed Variance of
Gradients [112.00379151834242]
We propose adaptive learning rate principle, in which the running mean of squared gradient in Adam is replaced by a weighted mean, with weights chosen to maximize the estimated variance each coordinate.
This results in faster adaptation, which leads more desirable empirical convergence behaviors.
arXiv Detail & Related papers (2020-06-21T21:47:43Z) - Iterative Averaging in the Quest for Best Test Error [22.987387623516614]
We analyse and explain the increased generalisation performance of iterate averaging using a Gaussian process perturbation model.
We derive three phenomena latestEdits from our theoretical results.
We showcase the efficacy of our approach on the CIFAR-10/100, ImageNet and Penn Treebank datasets.
arXiv Detail & Related papers (2020-03-02T23:27:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.