SGD with Partial Hessian for Deep Neural Networks Optimization
- URL: http://arxiv.org/abs/2403.02681v1
- Date: Tue, 5 Mar 2024 06:10:21 GMT
- Title: SGD with Partial Hessian for Deep Neural Networks Optimization
- Authors: Ying Sun, Hongwei Yong, Lei Zhang
- Abstract summary: We propose a compound, which is a combination of a second-order with a precise partial Hessian matrix for updating channel-wise parameters and the first-order gradient descent (SGD) algorithms for updating the other parameters.
Compared with first-orders, it adopts a certain amount of information from the Hessian matrix to assist optimization, while compared with the existing second-order generalizations, it keeps the good performance of first-order generalizations imprecise.
- Score: 18.78728272603732
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Due to the effectiveness of second-order algorithms in solving classical
optimization problems, designing second-order optimizers to train deep neural
networks (DNNs) has attracted much research interest in recent years. However,
because of the very high dimension of intermediate features in DNNs, it is
difficult to directly compute and store the Hessian matrix for network
optimization. Most of the previous second-order methods approximate the Hessian
information imprecisely, resulting in unstable performance. In this work, we
propose a compound optimizer, which is a combination of a second-order
optimizer with a precise partial Hessian matrix for updating channel-wise
parameters and the first-order stochastic gradient descent (SGD) optimizer for
updating the other parameters. We show that the associated Hessian matrices of
channel-wise parameters are diagonal and can be extracted directly and
precisely from Hessian-free methods. The proposed method, namely SGD with
Partial Hessian (SGD-PH), inherits the advantages of both first-order and
second-order optimizers. Compared with first-order optimizers, it adopts a
certain amount of information from the Hessian matrix to assist optimization,
while compared with the existing second-order optimizers, it keeps the good
generalization performance of first-order optimizers. Experiments on image
classification tasks demonstrate the effectiveness of our proposed optimizer
SGD-PH. The code is publicly available at
\url{https://github.com/myingysun/SGDPH}.
Related papers
- Efficient Second-Order Neural Network Optimization via Adaptive Trust Region Methods [0.0]
SecondOrderAdaptive (SOAA) is a novel optimization algorithm designed to overcome limitations of traditional second-order techniques.
We empirically demonstrate that SOAA achieves faster and more stable convergence compared to first-order approximations.
arXiv Detail & Related papers (2024-10-03T08:23:06Z) - AdaFisher: Adaptive Second Order Optimization via Fisher Information [22.851200800265914]
We present AdaFisher, an adaptive second-order that leverages a block-diagonal approximation to the Fisher information matrix for adaptive preconditioning gradient.
We demonstrate that AdaFisher outperforms the SOTAs in terms of both accuracy and convergence speed.
arXiv Detail & Related papers (2024-05-26T01:25:02Z) - Information-Theoretic Trust Regions for Stochastic Gradient-Based
Optimization [17.79206971486723]
We show that arTuRO combines the fast convergence of adaptive moment-based optimization with the capabilities of SGD.
We approximate the diagonal elements of the Hessian from gradients, constructing a model of the expected Hessian over time using only first-order information.
We show that arTuRO combines the fast convergence of adaptive moment-based optimization with the capabilities of SGD.
arXiv Detail & Related papers (2023-10-31T16:08:38Z) - ELRA: Exponential learning rate adaption gradient descent optimization
method [83.88591755871734]
We present a novel, fast (exponential rate), ab initio (hyper-free) gradient based adaption.
The main idea of the method is to adapt the $alpha by situational awareness.
It can be applied to problems of any dimensions n and scales only linearly.
arXiv Detail & Related papers (2023-09-12T14:36:13Z) - Bidirectional Looking with A Novel Double Exponential Moving Average to
Adaptive and Non-adaptive Momentum Optimizers [109.52244418498974]
We propose a novel textscAdmeta (textbfADouble exponential textbfMov averagtextbfE textbfAdaptive and non-adaptive momentum) framework.
We provide two implementations, textscAdmetaR and textscAdmetaS, the former based on RAdam and the latter based on SGDM.
arXiv Detail & Related papers (2023-07-02T18:16:06Z) - FOSI: Hybrid First and Second Order Optimization [11.447526245792154]
We present FOSI, a novel metaalgorithm that improves the performance of any base first-order by efficiently incorporating second-order information during the optimization process.
Our empirical evaluation demonstrates that FOSI improves the convergence rate and optimization time of first-order methods such as Heavy-Ball and Adam, and outperforms second-order methods (K-FAC and L-BFGS)
arXiv Detail & Related papers (2023-02-16T18:45:46Z) - Fast Computation of Optimal Transport via Entropy-Regularized Extragradient Methods [75.34939761152587]
Efficient computation of the optimal transport distance between two distributions serves as an algorithm that empowers various applications.
This paper develops a scalable first-order optimization-based method that computes optimal transport to within $varepsilon$ additive accuracy.
arXiv Detail & Related papers (2023-01-30T15:46:39Z) - DAG Learning on the Permutahedron [33.523216907730216]
We propose a continuous optimization framework for discovering a latent directed acyclic graph (DAG) from observational data.
Our approach optimize over the polytope of permutation vectors, the so-called Permutahedron, to learn a topological ordering.
arXiv Detail & Related papers (2023-01-27T18:22:25Z) - Bilevel Optimization: Convergence Analysis and Enhanced Design [63.64636047748605]
Bilevel optimization is a tool for many machine learning problems.
We propose a novel stoc-efficientgradient estimator named stoc-BiO.
arXiv Detail & Related papers (2020-10-15T18:09:48Z) - Multi-View Spectral Clustering with High-Order Optimal Neighborhood
Laplacian Matrix [57.11971786407279]
Multi-view spectral clustering can effectively reveal the intrinsic cluster structure among data.
This paper proposes a multi-view spectral clustering algorithm that learns a high-order optimal neighborhood Laplacian matrix.
Our proposed algorithm generates the optimal Laplacian matrix by searching the neighborhood of the linear combination of both the first-order and high-order base.
arXiv Detail & Related papers (2020-08-31T12:28:40Z) - Incorporating Expert Prior in Bayesian Optimisation via Space Warping [54.412024556499254]
In big search spaces the algorithm goes through several low function value regions before reaching the optimum of the function.
One approach to subside this cold start phase is to use prior knowledge that can accelerate the optimisation.
In this paper, we represent the prior knowledge about the function optimum through a prior distribution.
The prior distribution is then used to warp the search space in such a way that space gets expanded around the high probability region of function optimum and shrinks around low probability region of optimum.
arXiv Detail & Related papers (2020-03-27T06:18:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.