Apollo: An Adaptive Parameter-wise Diagonal Quasi-Newton Method for
Nonconvex Stochastic Optimization
- URL: http://arxiv.org/abs/2009.13586v6
- Date: Fri, 20 Aug 2021 05:31:09 GMT
- Title: Apollo: An Adaptive Parameter-wise Diagonal Quasi-Newton Method for
Nonconvex Stochastic Optimization
- Authors: Xuezhe Ma
- Abstract summary: We introduce a quasi-Newton method for non-github optimization, which dynamically incorporates the curvature of the loss by the Hessian.
The implementation of the algorithm is available at https://www.xuezmax.com/XuezMax/apollo.
- Score: 17.219297142656828
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we introduce Apollo, a quasi-Newton method for nonconvex
stochastic optimization, which dynamically incorporates the curvature of the
loss function by approximating the Hessian via a diagonal matrix. Importantly,
the update and storage of the diagonal approximation of Hessian is as efficient
as adaptive first-order optimization methods with linear complexity for both
time and memory. To handle nonconvexity, we replace the Hessian with its
rectified absolute value, which is guaranteed to be positive-definite.
Experiments on three tasks of vision and language show that Apollo achieves
significant improvements over other stochastic optimization methods, including
SGD and variants of Adam, in term of both convergence speed and generalization
performance. The implementation of the algorithm is available at
https://github.com/XuezheMax/apollo.
Related papers
- Information-Theoretic Trust Regions for Stochastic Gradient-Based
Optimization [17.79206971486723]
We show that arTuRO combines the fast convergence of adaptive moment-based optimization with the capabilities of SGD.
We approximate the diagonal elements of the Hessian from gradients, constructing a model of the expected Hessian over time using only first-order information.
We show that arTuRO combines the fast convergence of adaptive moment-based optimization with the capabilities of SGD.
arXiv Detail & Related papers (2023-10-31T16:08:38Z) - Series of Hessian-Vector Products for Tractable Saddle-Free Newton
Optimisation of Neural Networks [1.3654846342364308]
We show that a first-scalable optimisation algorithm can efficiently use the exact inverse Hessian with absolute-value eigenvalues.
A t-run of this series provides a new optimisation which is comparable to other first- and second-order optimisation methods.
arXiv Detail & Related papers (2023-10-23T13:11:30Z) - Stochastic Optimization for Non-convex Problem with Inexact Hessian
Matrix, Gradient, and Function [99.31457740916815]
Trust-region (TR) and adaptive regularization using cubics have proven to have some very appealing theoretical properties.
We show that TR and ARC methods can simultaneously provide inexact computations of the Hessian, gradient, and function values.
arXiv Detail & Related papers (2023-10-18T10:29:58Z) - ELRA: Exponential learning rate adaption gradient descent optimization
method [83.88591755871734]
We present a novel, fast (exponential rate), ab initio (hyper-free) gradient based adaption.
The main idea of the method is to adapt the $alpha by situational awareness.
It can be applied to problems of any dimensions n and scales only linearly.
arXiv Detail & Related papers (2023-09-12T14:36:13Z) - Faster Optimization on Sparse Graphs via Neural Reparametrization [15.275428333269453]
We show that a graph neural network can implement an efficient Quasi-Newton method that can speed up optimization by a factor of 10-100x.
We show the application of our method on scientifically relevant problems including heat diffusion, synchronization and persistent homology.
arXiv Detail & Related papers (2022-05-26T20:52:18Z) - Improved Convergence Rate of Stochastic Gradient Langevin Dynamics with
Variance Reduction and its Application to Optimization [50.83356836818667]
gradient Langevin Dynamics is one of the most fundamental algorithms to solve non-eps optimization problems.
In this paper, we show two variants of this kind, namely the Variance Reduced Langevin Dynamics and the Recursive Gradient Langevin Dynamics.
arXiv Detail & Related papers (2022-03-30T11:39:00Z) - BiAdam: Fast Adaptive Bilevel Optimization Methods [104.96004056928474]
Bilevel optimization has attracted increased interest in machine learning due to its many applications.
We provide a useful analysis framework for both the constrained and unconstrained optimization.
arXiv Detail & Related papers (2021-06-21T20:16:40Z) - Implicit differentiation for fast hyperparameter selection in non-smooth
convex learning [87.60600646105696]
We study first-order methods when the inner optimization problem is convex but non-smooth.
We show that the forward-mode differentiation of proximal gradient descent and proximal coordinate descent yield sequences of Jacobians converging toward the exact Jacobian.
arXiv Detail & Related papers (2021-05-04T17:31:28Z) - Slowly Varying Regression under Sparsity [5.22980614912553]
We present the framework of slowly hyper regression under sparsity, allowing regression models to exhibit slow and sparse variations.
We suggest a procedure that reformulates as a binary convex algorithm.
We show that the resulting model outperforms competing formulations in comparable times across various datasets.
arXiv Detail & Related papers (2021-02-22T04:51:44Z) - ADAHESSIAN: An Adaptive Second Order Optimizer for Machine Learning [91.13797346047984]
We introduce ADAHESSIAN, a second order optimization algorithm which dynamically incorporates the curvature of the loss function via ADAptive estimates.
We show that ADAHESSIAN achieves new state-of-the-art results by a large margin as compared to other adaptive optimization methods.
arXiv Detail & Related papers (2020-06-01T05:00:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.