Related papers: Two-Level K-FAC Preconditioning for Deep Learning

Two-Level K-FAC Preconditioning for Deep Learning

URL: http://arxiv.org/abs/2011.00573v3
Date: Sun, 6 Dec 2020 21:35:46 GMT
Title: Two-Level K-FAC Preconditioning for Deep Learning
Authors: Nikolaos Tselepidis and Jonas Kohler and Antonio Orvieto
Abstract summary: In the context of deep learning, many optimization methods use gradient covariance information in order to accelerate the convergence of Gradient Descent. In particular, starting with Adagrad, a seemingly endless line of research advocates the use of diagonal approximations of the so-called empirical Fisher matrix. One particularly successful variant of such methods is the so-called K-FAC, which uses a Kronecker-ed block-factored preconditioner.
Score: 7.699428789159717
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In the context of deep learning, many optimization methods use gradient covariance information in order to accelerate the convergence of Stochastic Gradient Descent. In particular, starting with Adagrad, a seemingly endless line of research advocates the use of diagonal approximations of the so-called empirical Fisher matrix in stochastic gradient-based algorithms, with the most prominent one arguably being Adam. However, in recent years, several works cast doubt on the theoretical basis of preconditioning with the empirical Fisher matrix, and it has been shown that more sophisticated approximations of the actual Fisher matrix more closely resemble the theoretically well-motivated Natural Gradient Descent. One particularly successful variant of such methods is the so-called K-FAC optimizer, which uses a Kronecker-factored block-diagonal Fisher approximation as preconditioner. In this work, drawing inspiration from two-level domain decomposition methods used as preconditioners in the field of scientific computing, we extend K-FAC by enriching it with off-diagonal (i.e. global) curvature information in a computationally efficient way. We achieve this by adding a coarse-space correction term to the preconditioner, which captures the global Fisher information matrix at a coarser scale. We present a small set of experimental results suggesting improved convergence behaviour of our proposed method.

Related papers

GP-FL: Model-Based Hessian Estimation for Second-Order Over-the-Air Federated Learning [52.295563400314094]
Second-order methods are widely adopted to improve the convergence rate of learning algorithms. This paper introduces a novel second-order FL framework tailored for wireless channels.
arXiv Detail & Related papers (2024-12-05T04:27:41Z)
Adaptive Federated Learning Over the Air [108.62635460744109]
We propose a federated version of adaptive gradient methods, particularly AdaGrad and Adam, within the framework of over-the-air model training. Our analysis shows that the AdaGrad-based training algorithm converges to a stationary point at the rate of $mathcalO( ln(T) / T 1 - frac1alpha ).
arXiv Detail & Related papers (2024-03-11T09:10:37Z)
Stochastic Gradient Descent for Gaussian Processes Done Right [86.83678041846971]
We show that when emphdone right -- by which we mean using specific insights from optimisation and kernel communities -- gradient descent is highly effective. We introduce a emphstochastic dual descent algorithm, explain its design in an intuitive manner and illustrate the design choices. Our method places Gaussian process regression on par with state-of-the-art graph neural networks for molecular binding affinity prediction.
arXiv Detail & Related papers (2023-10-31T16:15:13Z)
Neural Gradient Learning and Optimization for Oriented Point Normal Estimation [53.611206368815125]
We propose a deep learning approach to learn gradient vectors with consistent orientation from 3D point clouds for normal estimation. We learn an angular distance field based on local plane geometry to refine the coarse gradient vectors. Our method efficiently conducts global gradient approximation while achieving better accuracy and ability generalization of local feature description.
arXiv Detail & Related papers (2023-09-17T08:35:11Z)
Natural Gradient Methods: Perspectives, Efficient-Scalable Approximations, and Analysis [0.0]
Natural Gradient Descent is a second-degree optimization method motivated by the information geometry. It makes use of the Fisher Information Matrix instead of the Hessian which is typically used. Being a second-order method makes it infeasible to be used directly in problems with a huge number of parameters and data.
arXiv Detail & Related papers (2023-03-06T04:03:56Z)
Efficient Approximations of the Fisher Matrix in Neural Networks using Kronecker Product Singular Value Decomposition [0.0]
It is shown that natural gradient descent can minimize the objective function more efficiently than ordinary gradient descent based methods. The bottleneck of this approach for training deep neural networks lies in the prohibitive cost of solving a large dense linear system corresponding to the Fisher Information Matrix (FIM) at each iteration. This has motivated various approximations of either the exact FIM or the empirical one. The most sophisticated of these is KFAC, which involves a Kronecker-factored block diagonal approximation of the FIM. With only a slight additional cost, a few improvements of KFAC from the standpoint of accuracy are proposed
arXiv Detail & Related papers (2022-01-25T12:56:17Z)
Tensor Normal Training for Deep Learning Models [10.175972095073282]
We propose and analyze a brand new approximate natural gradient method, Normal Training. In our experiments, TNT exhibited superior optimization performance to first-order methods.
arXiv Detail & Related papers (2021-06-05T15:57:22Z)
Zeroth-Order Hybrid Gradient Descent: Towards A Principled Black-Box Optimization Framework [100.36569795440889]
This work is on the iteration of zero-th-order (ZO) optimization which does not require first-order information. We show that with a graceful design in coordinate importance sampling, the proposed ZO optimization method is efficient both in terms of complexity as well as as function query cost.
arXiv Detail & Related papers (2020-12-21T17:29:58Z)
Eigenvalue-corrected Natural Gradient Based on a New Approximation [31.1453204659019]
Eigenvalue-corrected Kronecker Factorization (EKFAC) is a proposed method for training deep neural networks (DNNs) In this work, we combine the ideas of these two methods and propose Trace-restricted Eigenvalue-corrected Kronecker Factorization (TEKFAC) The proposed method corrects the inexact re-scaling factor under the Kronecker-factored eigenbasis, but also considers the new approximation technique proposed in Gao et al.
arXiv Detail & Related papers (2020-11-27T08:57:29Z)
A Trace-restricted Kronecker-Factored Approximation to Natural Gradient [32.41025119083869]
We propose a new approximation to the Fisher information matrix called Trace-restricted Kronecker-factored Approximate Curvature (TKFAC) Experiments show that our method has better performance compared with several state-of-the-art algorithms on some deep network architectures.
arXiv Detail & Related papers (2020-11-21T07:47:14Z)
Cogradient Descent for Bilinear Optimization [124.45816011848096]
We introduce a Cogradient Descent algorithm (CoGD) to address the bilinear problem. We solve one variable by considering its coupling relationship with the other, leading to a synchronous gradient descent. Our algorithm is applied to solve problems with one variable under the sparsity constraint.
arXiv Detail & Related papers (2020-06-16T13:41:54Z)
Towards Better Understanding of Adaptive Gradient Algorithms in Generative Adversarial Nets [71.05306664267832]
Adaptive algorithms perform gradient updates using the history of gradients and are ubiquitous in training deep neural networks. In this paper we analyze a variant of OptimisticOA algorithm for nonconcave minmax problems. Our experiments show that adaptive GAN non-adaptive gradient algorithms can be observed empirically.
arXiv Detail & Related papers (2019-12-26T22:10:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.