Related papers: Ginger: An Efficient Curvature Approximation with Linear Complexity for General Neural Networks

Ginger: An Efficient Curvature Approximation with Linear Complexity for General Neural Networks

URL: http://arxiv.org/abs/2402.03295v1
Date: Mon, 5 Feb 2024 18:51:17 GMT
Title: Ginger: An Efficient Curvature Approximation with Linear Complexity for General Neural Networks
Authors: Yongchang Hao, Yanshuai Cao, Lili Mou
Abstract summary: Second-order optimization approaches like the Gauss-Newton method are considered more powerful as they utilize the curvature information of the objective function. We propose Ginger, eigende for approximation of the generalized generalized Gauss-Newton matrix.
Score: 33.96967104979137
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Second-order optimization approaches like the generalized Gauss-Newton method are considered more powerful as they utilize the curvature information of the objective function with preconditioning matrices. Albeit offering tempting theoretical benefits, they are not easily applicable to modern deep learning. The major reason is due to the quadratic memory and cubic time complexity to compute the inverse of the matrix. These requirements are infeasible even with state-of-the-art hardware. In this work, we propose Ginger, an eigendecomposition for the inverse of the generalized Gauss-Newton matrix. Our method enjoys efficient linear memory and time complexity for each iteration. Instead of approximating the conditioning matrix, we directly maintain its inverse to make the approximation more accurate. We provide the convergence result of Ginger for non-convex objectives. Our experiments on different tasks with different model architectures verify the effectiveness of our method. Our code is publicly available.

Related papers

KrADagrad: Kronecker Approximation-Domination Gradient Preconditioned Stochastic Optimization [69.47358238222586]
Second orderimations allow parameter update step size and direction to adapt to loss curvature. Recently, Shampoo introduced a Kronecker factored preconditioner to reduce these requirements. It takes inverse matrix roots of ill-conditioned matrices. This requires 64-bit precision, imposing strong hardware constraints.
arXiv Detail & Related papers (2023-05-30T21:15:45Z)
Neural incomplete factorization: learning preconditioners for the conjugate gradient method [2.899792823251184]
We develop a data-driven approach to accelerate the generation of effective preconditioners. We replace the typically hand-engineered preconditioners by the output of graph neural networks. Our method generates an incomplete factorization of the matrix and is, therefore, referred to as neural incomplete factorization (NeuralIF)
arXiv Detail & Related papers (2023-05-25T11:45:46Z)
Softmax-free Linear Transformers [90.83157268265654]
Vision transformers (ViTs) have pushed the state-of-the-art for visual perception tasks. Existing methods are either theoretically flawed or empirically ineffective for visual recognition. We propose a family of Softmax-Free Transformers (SOFT)
arXiv Detail & Related papers (2022-07-05T03:08:27Z)
Matrix Completion via Non-Convex Relaxation and Adaptive Correlation Learning [90.8576971748142]
We develop a novel surrogate that can be optimized by closed-form solutions. We exploit upperwise correlation for completion, and thus an adaptive correlation learning model.
arXiv Detail & Related papers (2022-03-04T08:50:50Z)
Fast Differentiable Matrix Square Root and Inverse Square Root [65.67315418971688]
We propose two more efficient variants to compute the differentiable matrix square root and the inverse square root. For the forward propagation, one method is to use Matrix Taylor Polynomial (MTP), and the other method is to use Matrix Pad'e Approximants (MPA) A series of numerical tests show that both methods yield considerable speed-up compared with the SVD or the NS iteration.
arXiv Detail & Related papers (2022-01-29T10:00:35Z)
Robust 1-bit Compressive Sensing with Partial Gaussian Circulant Matrices and Generative Priors [54.936314353063494]
We provide recovery guarantees for a correlation-based optimization algorithm for robust 1-bit compressive sensing. We make use of a practical iterative algorithm, and perform numerical experiments on image datasets to corroborate our results.
arXiv Detail & Related papers (2021-08-08T05:28:06Z)
SHINE: SHaring the INverse Estimate from the forward pass for bi-level optimization and implicit models [15.541264326378366]
In recent years, implicit deep learning has emerged as a method to increase the depth of deep neural networks. The training is performed as a bi-level problem, and its computational complexity is partially driven by the iterative inversion of a huge Jacobian matrix. We propose a novel strategy to tackle this computational bottleneck from which many bi-level problems suffer.
arXiv Detail & Related papers (2021-06-01T15:07:34Z)
Improved Knowledge Distillation via Full Kernel Matrix Transfer [21.533095275253466]
Knowledge distillation is an effective way for model compression in deep learning. We decompose the original full matrix with Nystr"om method. Compared with the full matrix, the size of the partial matrix is linear in the number of examples.
arXiv Detail & Related papers (2020-09-30T04:03:09Z)
Kernel methods through the roof: handling billions of points efficiently [94.31450736250918]
Kernel methods provide an elegant and principled approach to nonparametric learning, but so far could hardly be used in large scale problems. Recent advances have shown the benefits of a number of algorithmic ideas, for example combining optimization, numerical linear algebra and random projections. Here, we push these efforts further to develop and test a solver that takes full advantage of GPU hardware.
arXiv Detail & Related papers (2020-06-18T08:16:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.