Practical Quasi-Newton Methods for Training Deep Neural Networks
- URL: http://arxiv.org/abs/2006.08877v3
- Date: Thu, 7 Jan 2021 19:36:34 GMT
- Title: Practical Quasi-Newton Methods for Training Deep Neural Networks
- Authors: Donald Goldfarb, Yi Ren, Achraf Bahamou
- Abstract summary: In training, the number of variables and components of the gradient $n$ is often of the order of tens of millions and the Hessian has $n2$ elements.
We approximate the Hessian by a block-diagonal matrix and use the structure of the gradient and Hessian to further approximate these blocks.
Because the indefinite and highly variable nature of the Hessian in a DNN, we also propose a new damping approach to keep the upper as well as the lower bounds of the BFGS and L-BFGS approximations bounded.
- Score: 12.48022619079224
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We consider the development of practical stochastic quasi-Newton, and in
particular Kronecker-factored block-diagonal BFGS and L-BFGS methods, for
training deep neural networks (DNNs). In DNN training, the number of variables
and components of the gradient $n$ is often of the order of tens of millions
and the Hessian has $n^2$ elements. Consequently, computing and storing a full
$n \times n$ BFGS approximation or storing a modest number of (step, change in
gradient) vector pairs for use in an L-BFGS implementation is out of the
question. In our proposed methods, we approximate the Hessian by a
block-diagonal matrix and use the structure of the gradient and Hessian to
further approximate these blocks, each of which corresponds to a layer, as the
Kronecker product of two much smaller matrices. This is analogous to the
approach in KFAC, which computes a Kronecker-factored block-diagonal
approximation to the Fisher matrix in a stochastic natural gradient method.
Because the indefinite and highly variable nature of the Hessian in a DNN, we
also propose a new damping approach to keep the upper as well as the lower
bounds of the BFGS and L-BFGS approximations bounded. In tests on autoencoder
feed-forward neural network models with either nine or thirteen layers applied
to three datasets, our methods outperformed or performed comparably to KFAC and
state-of-the-art first-order stochastic methods.
Related papers
- Learning with Norm Constrained, Over-parameterized, Two-layer Neural Networks [54.177130905659155]
Recent studies show that a reproducing kernel Hilbert space (RKHS) is not a suitable space to model functions by neural networks.
In this paper, we study a suitable function space for over- parameterized two-layer neural networks with bounded norms.
arXiv Detail & Related papers (2024-04-29T15:04:07Z) - The Convex Landscape of Neural Networks: Characterizing Global Optima
and Stationary Points via Lasso Models [75.33431791218302]
Deep Neural Network Network (DNN) models are used for programming purposes.
In this paper we examine the use of convex neural recovery models.
We show that all the stationary non-dimensional objective objective can be characterized as the standard a global subsampled convex solvers program.
We also show that all the stationary non-dimensional objective objective can be characterized as the standard a global subsampled convex solvers program.
arXiv Detail & Related papers (2023-12-19T23:04:56Z) - Neural Gradient Learning and Optimization for Oriented Point Normal
Estimation [53.611206368815125]
We propose a deep learning approach to learn gradient vectors with consistent orientation from 3D point clouds for normal estimation.
We learn an angular distance field based on local plane geometry to refine the coarse gradient vectors.
Our method efficiently conducts global gradient approximation while achieving better accuracy and ability generalization of local feature description.
arXiv Detail & Related papers (2023-09-17T08:35:11Z) - Controlling the Inductive Bias of Wide Neural Networks by Modifying the Kernel's Spectrum [18.10812063219831]
We introduce Modified Spectrum Kernels (MSKs) to approximate kernels with desired eigenvalues.
We propose a preconditioned gradient descent method, which alters the trajectory of gradient descent.
Our method is both computationally efficient and simple to implement.
arXiv Detail & Related papers (2023-07-26T22:39:47Z) - Decomposed Diffusion Sampler for Accelerating Large-Scale Inverse
Problems [64.29491112653905]
We propose a novel and efficient diffusion sampling strategy that synergistically combines the diffusion sampling and Krylov subspace methods.
Specifically, we prove that if tangent space at a denoised sample by Tweedie's formula forms a Krylov subspace, then the CG with the denoised data ensures the data consistency update to remain in the tangent space.
Our proposed method achieves more than 80 times faster inference time than the previous state-of-the-art method.
arXiv Detail & Related papers (2023-03-10T07:42:49Z) - A Mini-Block Natural Gradient Method for Deep Neural Networks [12.48022619079224]
We propose and analyze the convergence of an approximate natural gradient method, mini-block Fisher (MBF)
Our novel approach utilizes the parallelism of generalization to efficiently perform on the large number of matrices in each layer.
arXiv Detail & Related papers (2022-02-08T20:01:48Z) - Scaling Neural Tangent Kernels via Sketching and Random Features [53.57615759435126]
Recent works report that NTK regression can outperform finitely-wide neural networks trained on small-scale datasets.
We design a near input-sparsity time approximation algorithm for NTK, by sketching the expansions of arc-cosine kernels.
We show that a linear regressor trained on our CNTK features matches the accuracy of exact CNTK on CIFAR-10 dataset while achieving 150x speedup.
arXiv Detail & Related papers (2021-06-15T04:44:52Z) - Research of Damped Newton Stochastic Gradient Descent Method for Neural
Network Training [6.231508838034926]
First-order methods like gradient descent(SGD) are recently the popular optimization method to train deep neural networks (DNNs)
In this paper, we propose the Damped Newton Descent(DN-SGD) and Gradient Descent Damped Newton(SGD-DN) methods to train DNNs for regression problems with Mean Square Error(MSE) and classification problems with Cross-Entropy Loss(CEL)
Our methods just accurately compute a small part of the parameters, which greatly reduces the computational cost and makes the learning process much faster and more accurate than SGD.
arXiv Detail & Related papers (2021-03-31T02:07:18Z) - Kronecker-factored Quasi-Newton Methods for Convolutional Neural
Networks [10.175972095073282]
KF-QN-CNN is a new quasi-factored training convolutional neural networks (CNNs)
KF-QN-CNN consistently exhibited superior performance in all of our tests.
arXiv Detail & Related papers (2021-02-12T19:40:34Z) - A Trace-restricted Kronecker-Factored Approximation to Natural Gradient [32.41025119083869]
We propose a new approximation to the Fisher information matrix called Trace-restricted Kronecker-factored Approximate Curvature (TKFAC)
Experiments show that our method has better performance compared with several state-of-the-art algorithms on some deep network architectures.
arXiv Detail & Related papers (2020-11-21T07:47:14Z) - Improving predictions of Bayesian neural nets via local linearization [79.21517734364093]
We argue that the Gauss-Newton approximation should be understood as a local linearization of the underlying Bayesian neural network (BNN)
Because we use this linearized model for posterior inference, we should also predict using this modified model instead of the original one.
We refer to this modified predictive as "GLM predictive" and show that it effectively resolves common underfitting problems of the Laplace approximation.
arXiv Detail & Related papers (2020-08-19T12:35:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.