Ginger: An Efficient Curvature Approximation with Linear Complexity for
General Neural Networks
- URL: http://arxiv.org/abs/2402.03295v1
- Date: Mon, 5 Feb 2024 18:51:17 GMT
- Title: Ginger: An Efficient Curvature Approximation with Linear Complexity for
General Neural Networks
- Authors: Yongchang Hao, Yanshuai Cao, Lili Mou
- Abstract summary: Second-order optimization approaches like the Gauss-Newton method are considered more powerful as they utilize the curvature information of the objective function.
We propose Ginger, eigende for approximation of the generalized generalized Gauss-Newton matrix.
- Score: 33.96967104979137
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Second-order optimization approaches like the generalized Gauss-Newton method
are considered more powerful as they utilize the curvature information of the
objective function with preconditioning matrices. Albeit offering tempting
theoretical benefits, they are not easily applicable to modern deep learning.
The major reason is due to the quadratic memory and cubic time complexity to
compute the inverse of the matrix. These requirements are infeasible even with
state-of-the-art hardware. In this work, we propose Ginger, an
eigendecomposition for the inverse of the generalized Gauss-Newton matrix. Our
method enjoys efficient linear memory and time complexity for each iteration.
Instead of approximating the conditioning matrix, we directly maintain its
inverse to make the approximation more accurate. We provide the convergence
result of Ginger for non-convex objectives. Our experiments on different tasks
with different model architectures verify the effectiveness of our method. Our
code is publicly available.
Related papers
- KrADagrad: Kronecker Approximation-Domination Gradient Preconditioned
Stochastic Optimization [69.47358238222586]
Second orderimations allow parameter update step size and direction to adapt to loss curvature.
Recently, Shampoo introduced a Kronecker factored preconditioner to reduce these requirements.
It takes inverse matrix roots of ill-conditioned matrices.
This requires 64-bit precision, imposing strong hardware constraints.
arXiv Detail & Related papers (2023-05-30T21:15:45Z) - Neural incomplete factorization: learning preconditioners for the conjugate gradient method [2.899792823251184]
We develop a data-driven approach to accelerate the generation of effective preconditioners.
We replace the typically hand-engineered preconditioners by the output of graph neural networks.
Our method generates an incomplete factorization of the matrix and is, therefore, referred to as neural incomplete factorization (NeuralIF)
arXiv Detail & Related papers (2023-05-25T11:45:46Z) - Softmax-free Linear Transformers [90.83157268265654]
Vision transformers (ViTs) have pushed the state-of-the-art for visual perception tasks.
Existing methods are either theoretically flawed or empirically ineffective for visual recognition.
We propose a family of Softmax-Free Transformers (SOFT)
arXiv Detail & Related papers (2022-07-05T03:08:27Z) - Matrix Completion via Non-Convex Relaxation and Adaptive Correlation
Learning [90.8576971748142]
We develop a novel surrogate that can be optimized by closed-form solutions.
We exploit upperwise correlation for completion, and thus an adaptive correlation learning model.
arXiv Detail & Related papers (2022-03-04T08:50:50Z) - Fast Differentiable Matrix Square Root and Inverse Square Root [65.67315418971688]
We propose two more efficient variants to compute the differentiable matrix square root and the inverse square root.
For the forward propagation, one method is to use Matrix Taylor Polynomial (MTP), and the other method is to use Matrix Pad'e Approximants (MPA)
A series of numerical tests show that both methods yield considerable speed-up compared with the SVD or the NS iteration.
arXiv Detail & Related papers (2022-01-29T10:00:35Z) - Robust 1-bit Compressive Sensing with Partial Gaussian Circulant
Matrices and Generative Priors [54.936314353063494]
We provide recovery guarantees for a correlation-based optimization algorithm for robust 1-bit compressive sensing.
We make use of a practical iterative algorithm, and perform numerical experiments on image datasets to corroborate our results.
arXiv Detail & Related papers (2021-08-08T05:28:06Z) - SHINE: SHaring the INverse Estimate from the forward pass for bi-level
optimization and implicit models [15.541264326378366]
In recent years, implicit deep learning has emerged as a method to increase the depth of deep neural networks.
The training is performed as a bi-level problem, and its computational complexity is partially driven by the iterative inversion of a huge Jacobian matrix.
We propose a novel strategy to tackle this computational bottleneck from which many bi-level problems suffer.
arXiv Detail & Related papers (2021-06-01T15:07:34Z) - Improved Knowledge Distillation via Full Kernel Matrix Transfer [21.533095275253466]
Knowledge distillation is an effective way for model compression in deep learning.
We decompose the original full matrix with Nystr"om method.
Compared with the full matrix, the size of the partial matrix is linear in the number of examples.
arXiv Detail & Related papers (2020-09-30T04:03:09Z) - Kernel methods through the roof: handling billions of points efficiently [94.31450736250918]
Kernel methods provide an elegant and principled approach to nonparametric learning, but so far could hardly be used in large scale problems.
Recent advances have shown the benefits of a number of algorithmic ideas, for example combining optimization, numerical linear algebra and random projections.
Here, we push these efforts further to develop and test a solver that takes full advantage of GPU hardware.
arXiv Detail & Related papers (2020-06-18T08:16:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.