Two-Level K-FAC Preconditioning for Deep Learning
- URL: http://arxiv.org/abs/2011.00573v3
- Date: Sun, 6 Dec 2020 21:35:46 GMT
- Title: Two-Level K-FAC Preconditioning for Deep Learning
- Authors: Nikolaos Tselepidis and Jonas Kohler and Antonio Orvieto
- Abstract summary: In the context of deep learning, many optimization methods use gradient covariance information in order to accelerate the convergence of Gradient Descent.
In particular, starting with Adagrad, a seemingly endless line of research advocates the use of diagonal approximations of the so-called empirical Fisher matrix.
One particularly successful variant of such methods is the so-called K-FAC, which uses a Kronecker-ed block-factored preconditioner.
- Score: 7.699428789159717
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In the context of deep learning, many optimization methods use gradient
covariance information in order to accelerate the convergence of Stochastic
Gradient Descent. In particular, starting with Adagrad, a seemingly endless
line of research advocates the use of diagonal approximations of the so-called
empirical Fisher matrix in stochastic gradient-based algorithms, with the most
prominent one arguably being Adam. However, in recent years, several works cast
doubt on the theoretical basis of preconditioning with the empirical Fisher
matrix, and it has been shown that more sophisticated approximations of the
actual Fisher matrix more closely resemble the theoretically well-motivated
Natural Gradient Descent. One particularly successful variant of such methods
is the so-called K-FAC optimizer, which uses a Kronecker-factored
block-diagonal Fisher approximation as preconditioner. In this work, drawing
inspiration from two-level domain decomposition methods used as preconditioners
in the field of scientific computing, we extend K-FAC by enriching it with
off-diagonal (i.e. global) curvature information in a computationally efficient
way. We achieve this by adding a coarse-space correction term to the
preconditioner, which captures the global Fisher information matrix at a
coarser scale. We present a small set of experimental results suggesting
improved convergence behaviour of our proposed method.
Related papers
- Adaptive Federated Learning Over the Air [108.62635460744109]
We propose a federated version of adaptive gradient methods, particularly AdaGrad and Adam, within the framework of over-the-air model training.
Our analysis shows that the AdaGrad-based training algorithm converges to a stationary point at the rate of $mathcalO( ln(T) / T 1 - frac1alpha ).
arXiv Detail & Related papers (2024-03-11T09:10:37Z) - Stochastic Gradient Descent for Gaussian Processes Done Right [86.83678041846971]
We show that when emphdone right -- by which we mean using specific insights from optimisation and kernel communities -- gradient descent is highly effective.
We introduce a emphstochastic dual descent algorithm, explain its design in an intuitive manner and illustrate the design choices.
Our method places Gaussian process regression on par with state-of-the-art graph neural networks for molecular binding affinity prediction.
arXiv Detail & Related papers (2023-10-31T16:15:13Z) - Neural Gradient Learning and Optimization for Oriented Point Normal
Estimation [53.611206368815125]
We propose a deep learning approach to learn gradient vectors with consistent orientation from 3D point clouds for normal estimation.
We learn an angular distance field based on local plane geometry to refine the coarse gradient vectors.
Our method efficiently conducts global gradient approximation while achieving better accuracy and ability generalization of local feature description.
arXiv Detail & Related papers (2023-09-17T08:35:11Z) - Natural Gradient Methods: Perspectives, Efficient-Scalable
Approximations, and Analysis [0.0]
Natural Gradient Descent is a second-degree optimization method motivated by the information geometry.
It makes use of the Fisher Information Matrix instead of the Hessian which is typically used.
Being a second-order method makes it infeasible to be used directly in problems with a huge number of parameters and data.
arXiv Detail & Related papers (2023-03-06T04:03:56Z) - Efficient Approximations of the Fisher Matrix in Neural Networks using
Kronecker Product Singular Value Decomposition [0.0]
It is shown that natural gradient descent can minimize the objective function more efficiently than ordinary gradient descent based methods.
The bottleneck of this approach for training deep neural networks lies in the prohibitive cost of solving a large dense linear system corresponding to the Fisher Information Matrix (FIM) at each iteration.
This has motivated various approximations of either the exact FIM or the empirical one.
The most sophisticated of these is KFAC, which involves a Kronecker-factored block diagonal approximation of the FIM.
With only a slight additional cost, a few improvements of KFAC from the standpoint of accuracy are proposed
arXiv Detail & Related papers (2022-01-25T12:56:17Z) - Tensor Normal Training for Deep Learning Models [10.175972095073282]
We propose and analyze a brand new approximate natural gradient method, Normal Training.
In our experiments, TNT exhibited superior optimization performance to first-order methods.
arXiv Detail & Related papers (2021-06-05T15:57:22Z) - Zeroth-Order Hybrid Gradient Descent: Towards A Principled Black-Box
Optimization Framework [100.36569795440889]
This work is on the iteration of zero-th-order (ZO) optimization which does not require first-order information.
We show that with a graceful design in coordinate importance sampling, the proposed ZO optimization method is efficient both in terms of complexity as well as as function query cost.
arXiv Detail & Related papers (2020-12-21T17:29:58Z) - Eigenvalue-corrected Natural Gradient Based on a New Approximation [31.1453204659019]
Eigenvalue-corrected Kronecker Factorization (EKFAC) is a proposed method for training deep neural networks (DNNs)
In this work, we combine the ideas of these two methods and propose Trace-restricted Eigenvalue-corrected Kronecker Factorization (TEKFAC)
The proposed method corrects the inexact re-scaling factor under the Kronecker-factored eigenbasis, but also considers the new approximation technique proposed in Gao et al.
arXiv Detail & Related papers (2020-11-27T08:57:29Z) - A Trace-restricted Kronecker-Factored Approximation to Natural Gradient [32.41025119083869]
We propose a new approximation to the Fisher information matrix called Trace-restricted Kronecker-factored Approximate Curvature (TKFAC)
Experiments show that our method has better performance compared with several state-of-the-art algorithms on some deep network architectures.
arXiv Detail & Related papers (2020-11-21T07:47:14Z) - Cogradient Descent for Bilinear Optimization [124.45816011848096]
We introduce a Cogradient Descent algorithm (CoGD) to address the bilinear problem.
We solve one variable by considering its coupling relationship with the other, leading to a synchronous gradient descent.
Our algorithm is applied to solve problems with one variable under the sparsity constraint.
arXiv Detail & Related papers (2020-06-16T13:41:54Z) - Towards Better Understanding of Adaptive Gradient Algorithms in
Generative Adversarial Nets [71.05306664267832]
Adaptive algorithms perform gradient updates using the history of gradients and are ubiquitous in training deep neural networks.
In this paper we analyze a variant of OptimisticOA algorithm for nonconcave minmax problems.
Our experiments show that adaptive GAN non-adaptive gradient algorithms can be observed empirically.
arXiv Detail & Related papers (2019-12-26T22:10:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.