Tensor Normal Training for Deep Learning Models
- URL: http://arxiv.org/abs/2106.02925v1
- Date: Sat, 5 Jun 2021 15:57:22 GMT
- Title: Tensor Normal Training for Deep Learning Models
- Authors: Yi Ren, Donald Goldfarb
- Abstract summary: We propose and analyze a brand new approximate natural gradient method, Normal Training.
In our experiments, TNT exhibited superior optimization performance to first-order methods.
- Score: 10.175972095073282
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite the predominant use of first-order methods for training deep learning
models, second-order methods, and in particular, natural gradient methods,
remain of interest because of their potential for accelerating training through
the use of curvature information. Several methods with non-diagonal
preconditioning matrices, including KFAC and Shampoo, have been proposed and
shown to be effective. Based on the so-called tensor normal (TN) distribution,
we propose and analyze a brand new approximate natural gradient method, Tensor
Normal Training (TNT), which like Shampoo, only requires knowledge on the shape
of the training parameters. By approximating the probabilistically based Fisher
matrix, as opposed to the empirical Fisher matrix, our method uses the
layer-wise covariance of the sampling based gradient as the pre-conditioning
matrix. Moreover, the assumption that the sampling-based (tensor) gradient
follows a TN distribution, ensures that its covariance has a Kronecker
separable structure, which leads to a tractable approximation to the Fisher
matrix. Consequently, TNT's memory requirements and per-iteration computational
costs are only slightly higher than those for first-order methods. In our
experiments, TNT exhibited superior optimization performance to KFAC and
Shampoo, and to state-of-the-art first-order methods. Moreover, TNT
demonstrated its ability to generalize as well as these first-order methods,
using fewer epochs.
Related papers
- Inverse-Free Fast Natural Gradient Descent Method for Deep Learning [52.0693420699086]
We present a fast natural gradient descent (FNGD) method that only requires inversion during the first epoch.
FNGD exhibits similarities to the average sum in first-order methods, leading to the computational complexity of FNGD being comparable to that of first-order methods.
arXiv Detail & Related papers (2024-03-06T05:13:28Z) - The Convex Landscape of Neural Networks: Characterizing Global Optima
and Stationary Points via Lasso Models [75.33431791218302]
Deep Neural Network Network (DNN) models are used for programming purposes.
In this paper we examine the use of convex neural recovery models.
We show that all the stationary non-dimensional objective objective can be characterized as the standard a global subsampled convex solvers program.
We also show that all the stationary non-dimensional objective objective can be characterized as the standard a global subsampled convex solvers program.
arXiv Detail & Related papers (2023-12-19T23:04:56Z) - Low-rank extended Kalman filtering for online learning of neural
networks from streaming data [71.97861600347959]
We propose an efficient online approximate Bayesian inference algorithm for estimating the parameters of a nonlinear function from a potentially non-stationary data stream.
The method is based on the extended Kalman filter (EKF), but uses a novel low-rank plus diagonal decomposition of the posterior matrix.
In contrast to methods based on variational inference, our method is fully deterministic, and does not require step-size tuning.
arXiv Detail & Related papers (2023-05-31T03:48:49Z) - Geometry-aware training of factorized layers in tensor Tucker format [6.701651480567394]
We introduce a novel approach to train the factors of a Tucker decomposition of the weight tensors.
Our training proposal proves to be optimal in locally approximating the original unfactorized dynamics.
We provide a theoretical analysis of the algorithm, showing convergence, approximation and local descent guarantees.
arXiv Detail & Related papers (2023-05-30T14:20:51Z) - Natural Gradient Methods: Perspectives, Efficient-Scalable
Approximations, and Analysis [0.0]
Natural Gradient Descent is a second-degree optimization method motivated by the information geometry.
It makes use of the Fisher Information Matrix instead of the Hessian which is typically used.
Being a second-order method makes it infeasible to be used directly in problems with a huge number of parameters and data.
arXiv Detail & Related papers (2023-03-06T04:03:56Z) - A Mini-Block Natural Gradient Method for Deep Neural Networks [12.48022619079224]
We propose and analyze the convergence of an approximate natural gradient method, mini-block Fisher (MBF)
Our novel approach utilizes the parallelism of generalization to efficiently perform on the large number of matrices in each layer.
arXiv Detail & Related papers (2022-02-08T20:01:48Z) - Efficient Approximations of the Fisher Matrix in Neural Networks using
Kronecker Product Singular Value Decomposition [0.0]
It is shown that natural gradient descent can minimize the objective function more efficiently than ordinary gradient descent based methods.
The bottleneck of this approach for training deep neural networks lies in the prohibitive cost of solving a large dense linear system corresponding to the Fisher Information Matrix (FIM) at each iteration.
This has motivated various approximations of either the exact FIM or the empirical one.
The most sophisticated of these is KFAC, which involves a Kronecker-factored block diagonal approximation of the FIM.
With only a slight additional cost, a few improvements of KFAC from the standpoint of accuracy are proposed
arXiv Detail & Related papers (2022-01-25T12:56:17Z) - Leveraging Non-uniformity in First-order Non-convex Optimization [93.6817946818977]
Non-uniform refinement of objective functions leads to emphNon-uniform Smoothness (NS) and emphNon-uniform Lojasiewicz inequality (NL)
New definitions inspire new geometry-aware first-order methods that converge to global optimality faster than the classical $Omega (1/t2)$ lower bounds.
arXiv Detail & Related papers (2021-05-13T04:23:07Z) - Two-Level K-FAC Preconditioning for Deep Learning [7.699428789159717]
In the context of deep learning, many optimization methods use gradient covariance information in order to accelerate the convergence of Gradient Descent.
In particular, starting with Adagrad, a seemingly endless line of research advocates the use of diagonal approximations of the so-called empirical Fisher matrix.
One particularly successful variant of such methods is the so-called K-FAC, which uses a Kronecker-ed block-factored preconditioner.
arXiv Detail & Related papers (2020-11-01T17:54:21Z) - Reintroducing Straight-Through Estimators as Principled Methods for
Stochastic Binary Networks [85.94999581306827]
Training neural networks with binary weights and activations is a challenging problem due to the lack of gradients and difficulty of optimization over discrete weights.
Many successful experimental results have been achieved with empirical straight-through (ST) approaches.
At the same time, ST methods can be truly derived as estimators in the binary network (SBN) model with Bernoulli weights.
arXiv Detail & Related papers (2020-06-11T23:58:18Z) - Interpolation Technique to Speed Up Gradients Propagation in Neural ODEs [71.26657499537366]
We propose a simple literature-based method for the efficient approximation of gradients in neural ODE models.
We compare it with the reverse dynamic method to train neural ODEs on classification, density estimation, and inference approximation tasks.
arXiv Detail & Related papers (2020-03-11T13:15:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.