A Mini-Block Natural Gradient Method for Deep Neural Networks
- URL: http://arxiv.org/abs/2202.04124v1
- Date: Tue, 8 Feb 2022 20:01:48 GMT
- Title: A Mini-Block Natural Gradient Method for Deep Neural Networks
- Authors: Achraf Bahamou, Donald Goldfarb, Yi Ren
- Abstract summary: We propose and analyze the convergence of an approximate natural gradient method, mini-block Fisher (MBF)
Our novel approach utilizes the parallelism of generalization to efficiently perform on the large number of matrices in each layer.
- Score: 12.48022619079224
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The training of deep neural networks (DNNs) is currently predominantly done
using first-order methods. Some of these methods (e.g., Adam, AdaGrad, and
RMSprop, and their variants) incorporate a small amount of curvature
information by using a diagonal matrix to precondition the stochastic gradient.
Recently, effective second-order methods, such as KFAC, K-BFGS, Shampoo, and
TNT, have been developed for training DNNs, by preconditioning the stochastic
gradient by layer-wise block-diagonal matrices. Here we propose and analyze the
convergence of an approximate natural gradient method, mini-block Fisher (MBF),
that lies in between these two classes of methods. Specifically, our method
uses a block-diagonal approximation to the Fisher matrix, where for each layer
in the DNN, whether it is convolutional or feed-forward and fully connected,
the associated diagonal block is also block-diagonal and is composed of a large
number of mini-blocks of modest size. Our novel approach utilizes the
parallelism of GPUs to efficiently perform computations on the large number of
matrices in each layer. Consequently, MBF's per-iteration computational cost is
only slightly higher than it is for first-order methods. Finally, the
performance of our proposed method is compared to that of several baseline
methods, on both Auto-encoder and CNN problems, to validate its effectiveness
both in terms of time efficiency and generalization power.
Related papers
- Low-rank extended Kalman filtering for online learning of neural
networks from streaming data [71.97861600347959]
We propose an efficient online approximate Bayesian inference algorithm for estimating the parameters of a nonlinear function from a potentially non-stationary data stream.
The method is based on the extended Kalman filter (EKF), but uses a novel low-rank plus diagonal decomposition of the posterior matrix.
In contrast to methods based on variational inference, our method is fully deterministic, and does not require step-size tuning.
arXiv Detail & Related papers (2023-05-31T03:48:49Z) - Layer-wise Adaptive Step-Sizes for Stochastic First-Order Methods for
Deep Learning [8.173034693197351]
We propose a new per-layer adaptive step-size procedure for first-order optimization methods in deep learning.
The proposed approach exploits the layer-wise curvature information contained in the diagonal blocks of the Hessian in deep neural networks (DNNs) to compute adaptive step-sizes (i.e., LRs) for each layer.
Numerical experiments show that SGD with momentum and AdamW combined with the proposed per-layer step-sizes are able to choose effective LR schedules.
arXiv Detail & Related papers (2023-05-23T04:12:55Z) - High-Dimensional Sparse Bayesian Learning without Covariance Matrices [66.60078365202867]
We introduce a new inference scheme that avoids explicit construction of the covariance matrix.
Our approach couples a little-known diagonal estimation result from numerical linear algebra with the conjugate gradient algorithm.
On several simulations, our method scales better than existing approaches in computation time and memory.
arXiv Detail & Related papers (2022-02-25T16:35:26Z) - Fast Differentiable Matrix Square Root and Inverse Square Root [65.67315418971688]
We propose two more efficient variants to compute the differentiable matrix square root and the inverse square root.
For the forward propagation, one method is to use Matrix Taylor Polynomial (MTP), and the other method is to use Matrix Pad'e Approximants (MPA)
A series of numerical tests show that both methods yield considerable speed-up compared with the SVD or the NS iteration.
arXiv Detail & Related papers (2022-01-29T10:00:35Z) - Unfolding Projection-free SDP Relaxation of Binary Graph Classifier via
GDPA Linearization [59.87663954467815]
Algorithm unfolding creates an interpretable and parsimonious neural network architecture by implementing each iteration of a model-based algorithm as a neural layer.
In this paper, leveraging a recent linear algebraic theorem called Gershgorin disc perfect alignment (GDPA), we unroll a projection-free algorithm for semi-definite programming relaxation (SDR) of a binary graph.
Experimental results show that our unrolled network outperformed pure model-based graph classifiers, and achieved comparable performance to pure data-driven networks but using far fewer parameters.
arXiv Detail & Related papers (2021-09-10T07:01:15Z) - SHINE: SHaring the INverse Estimate from the forward pass for bi-level
optimization and implicit models [15.541264326378366]
In recent years, implicit deep learning has emerged as a method to increase the depth of deep neural networks.
The training is performed as a bi-level problem, and its computational complexity is partially driven by the iterative inversion of a huge Jacobian matrix.
We propose a novel strategy to tackle this computational bottleneck from which many bi-level problems suffer.
arXiv Detail & Related papers (2021-06-01T15:07:34Z) - Kronecker-factored Quasi-Newton Methods for Convolutional Neural
Networks [10.175972095073282]
KF-QN-CNN is a new quasi-factored training convolutional neural networks (CNNs)
KF-QN-CNN consistently exhibited superior performance in all of our tests.
arXiv Detail & Related papers (2021-02-12T19:40:34Z) - A Trace-restricted Kronecker-Factored Approximation to Natural Gradient [32.41025119083869]
We propose a new approximation to the Fisher information matrix called Trace-restricted Kronecker-factored Approximate Curvature (TKFAC)
Experiments show that our method has better performance compared with several state-of-the-art algorithms on some deep network architectures.
arXiv Detail & Related papers (2020-11-21T07:47:14Z) - Practical Quasi-Newton Methods for Training Deep Neural Networks [12.48022619079224]
In training, the number of variables and components of the gradient $n$ is often of the order of tens of millions and the Hessian has $n2$ elements.
We approximate the Hessian by a block-diagonal matrix and use the structure of the gradient and Hessian to further approximate these blocks.
Because the indefinite and highly variable nature of the Hessian in a DNN, we also propose a new damping approach to keep the upper as well as the lower bounds of the BFGS and L-BFGS approximations bounded.
arXiv Detail & Related papers (2020-06-16T02:27:12Z) - Communication-Efficient Distributed Stochastic AUC Maximization with
Deep Neural Networks [50.42141893913188]
We study a distributed variable for large-scale AUC for a neural network as with a deep neural network.
Our model requires a much less number of communication rounds and still a number of communication rounds in theory.
Our experiments on several datasets show the effectiveness of our theory and also confirm our theory.
arXiv Detail & Related papers (2020-05-05T18:08:23Z) - Learning Low-rank Deep Neural Networks via Singular Vector Orthogonality
Regularization and Singular Value Sparsification [53.50708351813565]
We propose SVD training, the first method to explicitly achieve low-rank DNNs during training without applying SVD on every step.
We empirically show that SVD training can significantly reduce the rank of DNN layers and achieve higher reduction on computation load under the same accuracy.
arXiv Detail & Related papers (2020-04-20T02:40:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.