Related papers: An iterative K-FAC algorithm for Deep Learning

An iterative K-FAC algorithm for Deep Learning

URL: http://arxiv.org/abs/2101.00218v1
Date: Fri, 1 Jan 2021 12:04:01 GMT
Title: An iterative K-FAC algorithm for Deep Learning
Authors: Yingshi Chen
Abstract summary: Key of K-FAC is to approximates Fisher information matrix (FIM) as a block-diagonal matrix where each block is an inverse of tiny Kronecker factors. In this short note, we present CG-FAC -- an new iterative K-FAC algorithm. We prove that the time and memory complexity of iterative CG-FAC is much less than that of standard K-FAC algorithm.
Score: 0.0
License: http://creativecommons.org/publicdomain/zero/1.0/
Abstract: Kronecker-factored Approximate Curvature (K-FAC) method is a high efficiency second order optimizer for the deep learning. Its training time is less than SGD(or other first-order method) with same accuracy in many large-scale problems. The key of K-FAC is to approximates Fisher information matrix (FIM) as a block-diagonal matrix where each block is an inverse of tiny Kronecker factors. In this short note, we present CG-FAC -- an new iterative K-FAC algorithm. It uses conjugate gradient method to approximate the nature gradient. This CG-FAC method is matrix-free, that is, no need to generate the FIM matrix, also no need to generate the Kronecker factors A and G. We prove that the time and memory complexity of iterative CG-FAC is much less than that of standard K-FAC algorithm.

Related papers

Structured Inverse-Free Natural Gradient: Memory-Efficient & Numerically-Stable KFAC [26.275682325827706]
Second-order methods such as KFAC can be useful for neural net training. However, they are often memory-inefficient since their preconditioning Kronecker factors are dense. We formulate an inverse-free KFAC update and impose structures in the Kronecker factors, resulting in structured inverse-free gradient natural descent.
arXiv Detail & Related papers (2023-12-09T23:13:32Z)
A Near-Optimal Single-Loop Stochastic Algorithm for Convex Finite-Sum Coupled Compositional Optimization [53.14532968909759]
We introduce an efficient single-loop primal-dual block-coordinate algorithm called ALEXR. We establish the convergence rates of ALEXR in both convex and strongly convex cases under smoothness and non-smoothness conditions. Experimental results on GDRO and partial Area Under the ROC Curve for cFCCO demonstrate the promising performance of our algorithm.
arXiv Detail & Related papers (2023-12-04T19:00:07Z)
Stochastic Optimization for Non-convex Problem with Inexact Hessian Matrix, Gradient, and Function [99.31457740916815]
Trust-region (TR) and adaptive regularization using cubics have proven to have some very appealing theoretical properties. We show that TR and ARC methods can simultaneously provide inexact computations of the Hessian, gradient, and function values.
arXiv Detail & Related papers (2023-10-18T10:29:58Z)
Brand New K-FACs: Speeding up K-FAC with Online Decomposition Updates [0.0]
We exploit the exponential-average construction paradigm of the K-factors, and use online numerical linear algebra techniques. We propose a K-factor inverse update which scales linearly in layer size. We also propose an inverse application procedure which scales linearly as well.
arXiv Detail & Related papers (2022-10-16T09:41:23Z)
Large-scale Optimization of Partial AUC in a Range of False Positive Rates [51.12047280149546]
The area under the ROC curve (AUC) is one of the most widely used performance measures for classification models in machine learning. We develop an efficient approximated gradient descent method based on recent practical envelope smoothing technique. Our proposed algorithm can also be used to minimize the sum of some ranked range loss, which also lacks efficient solvers.
arXiv Detail & Related papers (2022-03-03T03:46:18Z)
High-Dimensional Sparse Bayesian Learning without Covariance Matrices [66.60078365202867]
We introduce a new inference scheme that avoids explicit construction of the covariance matrix. Our approach couples a little-known diagonal estimation result from numerical linear algebra with the conjugate gradient algorithm. On several simulations, our method scales better than existing approaches in computation time and memory.
arXiv Detail & Related papers (2022-02-25T16:35:26Z)
Efficient Approximations of the Fisher Matrix in Neural Networks using Kronecker Product Singular Value Decomposition [0.0]
It is shown that natural gradient descent can minimize the objective function more efficiently than ordinary gradient descent based methods. The bottleneck of this approach for training deep neural networks lies in the prohibitive cost of solving a large dense linear system corresponding to the Fisher Information Matrix (FIM) at each iteration. This has motivated various approximations of either the exact FIM or the empirical one. The most sophisticated of these is KFAC, which involves a Kronecker-factored block diagonal approximation of the FIM. With only a slight additional cost, a few improvements of KFAC from the standpoint of accuracy are proposed
arXiv Detail & Related papers (2022-01-25T12:56:17Z)
Breaking the Linear Iteration Cost Barrier for Some Well-known Conditional Gradient Methods Using MaxIP Data-structures [49.73889315176884]
Conditional gradient methods (CGM) are widely used in modern machine learning. Most efforts focus on reducing the number of iterations as a means to reduce the overall running time. We show the first algorithm, where the cost per iteration is sublinear in the number of parameters, for many fundamental optimization algorithms.
arXiv Detail & Related papers (2021-11-30T05:40:14Z)
Correcting Momentum with Second-order Information [50.992629498861724]
We develop a new algorithm for non-critical optimization that finds an $O(epsilon)$epsilon point in the optimal product. We validate our results on a variety of large-scale deep learning benchmarks and architectures.
arXiv Detail & Related papers (2021-03-04T19:01:20Z)
A Trace-restricted Kronecker-Factored Approximation to Natural Gradient [32.41025119083869]
We propose a new approximation to the Fisher information matrix called Trace-restricted Kronecker-factored Approximate Curvature (TKFAC) Experiments show that our method has better performance compared with several state-of-the-art algorithms on some deep network architectures.
arXiv Detail & Related papers (2020-11-21T07:47:14Z)
Two-Level K-FAC Preconditioning for Deep Learning [7.699428789159717]
In the context of deep learning, many optimization methods use gradient covariance information in order to accelerate the convergence of Gradient Descent. In particular, starting with Adagrad, a seemingly endless line of research advocates the use of diagonal approximations of the so-called empirical Fisher matrix. One particularly successful variant of such methods is the so-called K-FAC, which uses a Kronecker-ed block-factored preconditioner.
arXiv Detail & Related papers (2020-11-01T17:54:21Z)
Towards Minimax Optimal Reinforcement Learning in Factored Markov Decision Processes [53.72166325215299]
We study minimax optimal reinforcement learning in episodic factored Markov decision processes (FMDPs) First one achieves minimax optimal regret guarantees for a rich class of factored structures. Second one enjoys better computational complexity with a slightly worse regret.
arXiv Detail & Related papers (2020-06-24T00:50:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.