An iterative K-FAC algorithm for Deep Learning
- URL: http://arxiv.org/abs/2101.00218v1
- Date: Fri, 1 Jan 2021 12:04:01 GMT
- Title: An iterative K-FAC algorithm for Deep Learning
- Authors: Yingshi Chen
- Abstract summary: Key of K-FAC is to approximates Fisher information matrix (FIM) as a block-diagonal matrix where each block is an inverse of tiny Kronecker factors.
In this short note, we present CG-FAC -- an new iterative K-FAC algorithm.
We prove that the time and memory complexity of iterative CG-FAC is much less than that of standard K-FAC algorithm.
- Score: 0.0
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: Kronecker-factored Approximate Curvature (K-FAC) method is a high efficiency
second order optimizer for the deep learning. Its training time is less than
SGD(or other first-order method) with same accuracy in many large-scale
problems. The key of K-FAC is to approximates Fisher information matrix (FIM)
as a block-diagonal matrix where each block is an inverse of tiny Kronecker
factors. In this short note, we present CG-FAC -- an new iterative K-FAC
algorithm. It uses conjugate gradient method to approximate the nature
gradient. This CG-FAC method is matrix-free, that is, no need to generate the
FIM matrix, also no need to generate the Kronecker factors A and G. We prove
that the time and memory complexity of iterative CG-FAC is much less than that
of standard K-FAC algorithm.
Related papers
- Structured Inverse-Free Natural Gradient: Memory-Efficient & Numerically-Stable KFAC [26.275682325827706]
Second-order methods such as KFAC can be useful for neural net training.
However, they are often memory-inefficient since their preconditioning Kronecker factors are dense.
We formulate an inverse-free KFAC update and impose structures in the Kronecker factors, resulting in structured inverse-free gradient natural descent.
arXiv Detail & Related papers (2023-12-09T23:13:32Z) - Stochastic Optimization for Non-convex Problem with Inexact Hessian
Matrix, Gradient, and Function [99.31457740916815]
Trust-region (TR) and adaptive regularization using cubics have proven to have some very appealing theoretical properties.
We show that TR and ARC methods can simultaneously provide inexact computations of the Hessian, gradient, and function values.
arXiv Detail & Related papers (2023-10-18T10:29:58Z) - Brand New K-FACs: Speeding up K-FAC with Online Decomposition Updates [0.0]
We exploit the exponential-average construction paradigm of the K-factors, and use online numerical linear algebra techniques.
We propose a K-factor inverse update which scales linearly in layer size.
We also propose an inverse application procedure which scales linearly as well.
arXiv Detail & Related papers (2022-10-16T09:41:23Z) - Large-scale Optimization of Partial AUC in a Range of False Positive
Rates [51.12047280149546]
The area under the ROC curve (AUC) is one of the most widely used performance measures for classification models in machine learning.
We develop an efficient approximated gradient descent method based on recent practical envelope smoothing technique.
Our proposed algorithm can also be used to minimize the sum of some ranked range loss, which also lacks efficient solvers.
arXiv Detail & Related papers (2022-03-03T03:46:18Z) - High-Dimensional Sparse Bayesian Learning without Covariance Matrices [66.60078365202867]
We introduce a new inference scheme that avoids explicit construction of the covariance matrix.
Our approach couples a little-known diagonal estimation result from numerical linear algebra with the conjugate gradient algorithm.
On several simulations, our method scales better than existing approaches in computation time and memory.
arXiv Detail & Related papers (2022-02-25T16:35:26Z) - Efficient Approximations of the Fisher Matrix in Neural Networks using
Kronecker Product Singular Value Decomposition [0.0]
It is shown that natural gradient descent can minimize the objective function more efficiently than ordinary gradient descent based methods.
The bottleneck of this approach for training deep neural networks lies in the prohibitive cost of solving a large dense linear system corresponding to the Fisher Information Matrix (FIM) at each iteration.
This has motivated various approximations of either the exact FIM or the empirical one.
The most sophisticated of these is KFAC, which involves a Kronecker-factored block diagonal approximation of the FIM.
With only a slight additional cost, a few improvements of KFAC from the standpoint of accuracy are proposed
arXiv Detail & Related papers (2022-01-25T12:56:17Z) - Breaking the Linear Iteration Cost Barrier for Some Well-known
Conditional Gradient Methods Using MaxIP Data-structures [49.73889315176884]
Conditional gradient methods (CGM) are widely used in modern machine learning.
Most efforts focus on reducing the number of iterations as a means to reduce the overall running time.
We show the first algorithm, where the cost per iteration is sublinear in the number of parameters, for many fundamental optimization algorithms.
arXiv Detail & Related papers (2021-11-30T05:40:14Z) - Correcting Momentum with Second-order Information [50.992629498861724]
We develop a new algorithm for non-critical optimization that finds an $O(epsilon)$epsilon point in the optimal product.
We validate our results on a variety of large-scale deep learning benchmarks and architectures.
arXiv Detail & Related papers (2021-03-04T19:01:20Z) - A Trace-restricted Kronecker-Factored Approximation to Natural Gradient [32.41025119083869]
We propose a new approximation to the Fisher information matrix called Trace-restricted Kronecker-factored Approximate Curvature (TKFAC)
Experiments show that our method has better performance compared with several state-of-the-art algorithms on some deep network architectures.
arXiv Detail & Related papers (2020-11-21T07:47:14Z) - Two-Level K-FAC Preconditioning for Deep Learning [7.699428789159717]
In the context of deep learning, many optimization methods use gradient covariance information in order to accelerate the convergence of Gradient Descent.
In particular, starting with Adagrad, a seemingly endless line of research advocates the use of diagonal approximations of the so-called empirical Fisher matrix.
One particularly successful variant of such methods is the so-called K-FAC, which uses a Kronecker-ed block-factored preconditioner.
arXiv Detail & Related papers (2020-11-01T17:54:21Z) - Towards Minimax Optimal Reinforcement Learning in Factored Markov
Decision Processes [53.72166325215299]
We study minimax optimal reinforcement learning in episodic factored Markov decision processes (FMDPs)
First one achieves minimax optimal regret guarantees for a rich class of factored structures.
Second one enjoys better computational complexity with a slightly worse regret.
arXiv Detail & Related papers (2020-06-24T00:50:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.