Related papers: ADAHESSIAN: An Adaptive Second Order Optimizer for Machine Learning

ADAHESSIAN: An Adaptive Second Order Optimizer for Machine Learning

URL: http://arxiv.org/abs/2006.00719v3
Date: Thu, 29 Apr 2021 00:52:09 GMT
Title: ADAHESSIAN: An Adaptive Second Order Optimizer for Machine Learning
Authors: Zhewei Yao, Amir Gholami, Sheng Shen, Mustafa Mustafa, Kurt Keutzer, Michael W. Mahoney
Abstract summary: We introduce ADAHESSIAN, a second order optimization algorithm which dynamically incorporates the curvature of the loss function via ADAptive estimates. We show that ADAHESSIAN achieves new state-of-the-art results by a large margin as compared to other adaptive optimization methods.
Score: 91.13797346047984
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We introduce ADAHESSIAN, a second order stochastic optimization algorithm which dynamically incorporates the curvature of the loss function via ADAptive estimates of the HESSIAN. Second order algorithms are among the most powerful optimization algorithms with superior convergence properties as compared to first order methods such as SGD and Adam. The main disadvantage of traditional second order methods is their heavier per iteration computation and poor accuracy as compared to first order methods. To address these, we incorporate several novel approaches in ADAHESSIAN, including: (i) a fast Hutchinson based method to approximate the curvature matrix with low computational overhead; (ii) a root-mean-square exponential moving average to smooth out variations of the Hessian diagonal across different iterations; and (iii) a block diagonal averaging to reduce the variance of Hessian diagonal elements. We show that ADAHESSIAN achieves new state-of-the-art results by a large margin as compared to other adaptive optimization methods, including variants of Adam. In particular, we perform extensive tests on CV, NLP, and recommendation system tasks and find that ADAHESSIAN: (i) achieves 1.80%/1.45% higher accuracy on ResNets20/32 on Cifar10, and 5.55% higher accuracy on ImageNet as compared to Adam; (ii) outperforms AdamW for transformers by 0.13/0.33 BLEU score on IWSLT14/WMT14 and 2.7/1.0 PPL on PTB/Wikitext-103; (iii) outperforms AdamW for SqueezeBert by 0.41 points on GLUE; and (iv) achieves 0.032% better score than Adagrad for DLRM on the Criteo Ad Kaggle dataset. Importantly, we show that the cost per iteration of ADAHESSIAN is comparable to first order methods, and that it exhibits robustness towards its hyperparameters.

Related papers

Second-order Optimization of Gaussian Splats with Importance Sampling [51.95046424364725]
3D Gaussian Splatting (3DGS) is widely used for novel view rendering due to its high quality and fast inference time. We propose a novel second-order optimization strategy based on Levenberg-Marquardt (LM) and Conjugate Gradient (CG) Our method achieves a $3times$ speedup over standard LM and outperforms Adam by $6times$ when the Gaussian count is low.
arXiv Detail & Related papers (2025-04-17T12:52:08Z)
MARS: Unleashing the Power of Variance Reduction for Training Large Models [56.47014540413659]
Large gradient algorithms like Adam, Adam, and their variants have been central to the development of this type of training. We propose a framework that reconciles preconditioned gradient optimization methods with variance reduction via a scaled momentum technique.
arXiv Detail & Related papers (2024-11-15T18:57:39Z)
Deconstructing What Makes a Good Optimizer for Language Models [7.9224468703944115]
We compare several optimization algorithms, including SGD, Adafactor, Adam, Lion, and Sophia. No single algorithm emerged as a clear winner in terms of performance or stability to hyperparameter misspecification.
arXiv Detail & Related papers (2024-07-10T18:11:40Z)
Adapprox: Adaptive Approximation in Adam Optimization via Randomized Low-Rank Matrices [24.319712013824876]
Adapprox is a novel approach that employs randomized low-rank matrix approximation for a more accurate approximation of Adam's second moment. In GPT-2 training and downstream tasks, Adapprox surpasses AdamW by achieving 34.5% to 49.9% memory savings.
arXiv Detail & Related papers (2024-03-22T05:23:31Z)
Studying K-FAC Heuristics by Viewing Adam through a Second-Order Lens [34.72514951778262]
We study AdamQLR: an optimiser combining damping and learning rate selection techniques from K-FAC. We evaluate AdamQLR on a range of regression and classification tasks at various scales. Finding an untuned AdamQLR setting can achieve comparable performance vs runtime to tuned benchmarks.
arXiv Detail & Related papers (2023-10-23T14:06:46Z)
Bidirectional Looking with A Novel Double Exponential Moving Average to Adaptive and Non-adaptive Momentum Optimizers [109.52244418498974]
We propose a novel textscAdmeta (textbfADouble exponential textbfMov averagtextbfE textbfAdaptive and non-adaptive momentum) framework. We provide two implementations, textscAdmetaR and textscAdmetaS, the former based on RAdam and the latter based on SGDM.
arXiv Detail & Related papers (2023-07-02T18:16:06Z)
Efficient first-order predictor-corrector multiple objective optimization for fair misinformation detection [5.139559672771439]
Multiple-objective optimization (MOO) aims to simultaneously optimize multiple conflicting objectives and has found important applications in machine learning. We propose a Gauss-Newton approximation that only scales linearly, and that requires only first-order inner-product per iteration. The innovations make predictor-corrector possible for large networks.
arXiv Detail & Related papers (2022-09-15T12:32:15Z)
ZARTS: On Zero-order Optimization for Neural Architecture Search [94.41017048659664]
Differentiable architecture search (DARTS) has been a popular one-shot paradigm for NAS due to its high efficiency. This work turns to zero-order optimization and proposes a novel NAS scheme, called ZARTS, to search without enforcing the above approximation. In particular, results on 12 benchmarks verify the outstanding robustness of ZARTS, where the performance of DARTS collapses due to its known instability issue.
arXiv Detail & Related papers (2021-10-10T09:35:15Z)
Correcting Momentum with Second-order Information [50.992629498861724]
We develop a new algorithm for non-critical optimization that finds an $O(epsilon)$epsilon point in the optimal product. We validate our results on a variety of large-scale deep learning benchmarks and architectures.
arXiv Detail & Related papers (2021-03-04T19:01:20Z)
Apollo: An Adaptive Parameter-wise Diagonal Quasi-Newton Method for Nonconvex Stochastic Optimization [17.219297142656828]
We introduce a quasi-Newton method for non-github optimization, which dynamically incorporates the curvature of the loss by the Hessian. The implementation of the algorithm is available at https://www.xuezmax.com/XuezMax/apollo.
arXiv Detail & Related papers (2020-09-28T19:07:02Z)
MaxVA: Fast Adaptation of Step Sizes by Maximizing Observed Variance of Gradients [112.00379151834242]
We propose adaptive learning rate principle, in which the running mean of squared gradient in Adam is replaced by a weighted mean, with weights chosen to maximize the estimated variance each coordinate. This results in faster adaptation, which leads more desirable empirical convergence behaviors.
arXiv Detail & Related papers (2020-06-21T21:47:43Z)
Iterative Averaging in the Quest for Best Test Error [22.987387623516614]
We analyse and explain the increased generalisation performance of iterate averaging using a Gaussian process perturbation model. We derive three phenomena latestEdits from our theoretical results. We showcase the efficacy of our approach on the CIFAR-10/100, ImageNet and Penn Treebank datasets.
arXiv Detail & Related papers (2020-03-02T23:27:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.