Kronecker-factored Quasi-Newton Methods for Convolutional Neural
Networks
- URL: http://arxiv.org/abs/2102.06737v1
- Date: Fri, 12 Feb 2021 19:40:34 GMT
- Title: Kronecker-factored Quasi-Newton Methods for Convolutional Neural
Networks
- Authors: Yi Ren, Donald Goldfarb
- Abstract summary: KF-QN-CNN is a new quasi-factored training convolutional neural networks (CNNs)
KF-QN-CNN consistently exhibited superior performance in all of our tests.
- Score: 10.175972095073282
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Second-order methods have the capability of accelerating optimization by
using much richer curvature information than first-order methods. However, most
are impractical in a deep learning setting where the number of training
parameters is huge. In this paper, we propose KF-QN-CNN, a new
Kronecker-factored quasi-Newton method for training convolutional neural
networks (CNNs), where the Hessian is approximated by a layer-wise block
diagonal matrix and each layer's diagonal block is further approximated by a
Kronecker product corresponding to the structure of the Hessian restricted to
that layer. New damping and Hessian-action techniques for BFGS are designed to
deal with the non-convexity and the particularly large size of Kronecker
matrices in CNN models and convergence results are proved for a variant of
KF-QN-CNN under relatively mild conditions. KF-QN-CNN has memory requirements
comparable to first-order methods and much less per-iteration time complexity
than traditional second-order methods. Compared with state-of-the-art first-
and second-order methods on several CNN models, KF-QN-CNN consistently
exhibited superior performance in all of our tests.
Related papers
- Kronecker-Factored Approximate Curvature for Physics-Informed Neural Networks [3.7308074617637588]
We propose Kronecker-factored approximate curvature (KFAC) for PINN losses that greatly reduces the computational cost and allows scaling to much larger networks.
We find that our KFAC-based gradients are competitive with expensive second-order methods on small problems, scale more favorably to higher-dimensional neural networks and PDEs, and consistently outperform first-order methods and LBFGS.
arXiv Detail & Related papers (2024-05-24T14:36:02Z) - The Convex Landscape of Neural Networks: Characterizing Global Optima
and Stationary Points via Lasso Models [75.33431791218302]
Deep Neural Network Network (DNN) models are used for programming purposes.
In this paper we examine the use of convex neural recovery models.
We show that all the stationary non-dimensional objective objective can be characterized as the standard a global subsampled convex solvers program.
We also show that all the stationary non-dimensional objective objective can be characterized as the standard a global subsampled convex solvers program.
arXiv Detail & Related papers (2023-12-19T23:04:56Z) - Kronecker-Factored Approximate Curvature for Modern Neural Network
Architectures [85.76673783330334]
Two different settings of linear weight-sharing layers motivate two flavours of Kronecker-Factored Approximate Curvature (K-FAC)
We show they are exact for deep linear networks with weight-sharing in their respective setting.
We observe little difference between these two K-FAC variations when using them to train both a graph neural network and a vision transformer.
arXiv Detail & Related papers (2023-11-01T16:37:00Z) - Compacting Binary Neural Networks by Sparse Kernel Selection [58.84313343190488]
This paper is motivated by a previously revealed phenomenon that the binary kernels in successful BNNs are nearly power-law distributed.
We develop the Permutation Straight-Through Estimator (PSTE) that is able to not only optimize the selection process end-to-end but also maintain the non-repetitive occupancy of selected codewords.
Experiments verify that our method reduces both the model size and bit-wise computational costs, and achieves accuracy improvements compared with state-of-the-art BNNs under comparable budgets.
arXiv Detail & Related papers (2023-03-25T13:53:02Z) - Dual Convexified Convolutional Neural Networks [27.0231994885228]
We propose the framework of dual convexified convolutional neural networks (DCCNNs)
In this framework, we first introduce a primal learning problem motivated from convexified convolutional neural networks (CCNNs)
We then construct the dual convex training program through careful analysis of the Karush-Kuhn-Tucker (KKT) conditions and Fenchel conjugates.
arXiv Detail & Related papers (2022-05-27T15:45:08Z) - A Mini-Block Natural Gradient Method for Deep Neural Networks [12.48022619079224]
We propose and analyze the convergence of an approximate natural gradient method, mini-block Fisher (MBF)
Our novel approach utilizes the parallelism of generalization to efficiently perform on the large number of matrices in each layer.
arXiv Detail & Related papers (2022-02-08T20:01:48Z) - A Trace-restricted Kronecker-Factored Approximation to Natural Gradient [32.41025119083869]
We propose a new approximation to the Fisher information matrix called Trace-restricted Kronecker-factored Approximate Curvature (TKFAC)
Experiments show that our method has better performance compared with several state-of-the-art algorithms on some deep network architectures.
arXiv Detail & Related papers (2020-11-21T07:47:14Z) - Connecting Weighted Automata, Tensor Networks and Recurrent Neural
Networks through Spectral Learning [58.14930566993063]
We present connections between three models used in different research fields: weighted finite automata(WFA) from formal languages and linguistics, recurrent neural networks used in machine learning, and tensor networks.
We introduce the first provable learning algorithm for linear 2-RNN defined over sequences of continuous vectors input.
arXiv Detail & Related papers (2020-10-19T15:28:00Z) - ACDC: Weight Sharing in Atom-Coefficient Decomposed Convolution [57.635467829558664]
We introduce a structural regularization across convolutional kernels in a CNN.
We show that CNNs now maintain performance with dramatic reduction in parameters and computations.
arXiv Detail & Related papers (2020-09-04T20:41:47Z) - Finite Versus Infinite Neural Networks: an Empirical Study [69.07049353209463]
kernel methods outperform fully-connected finite-width networks.
Centered and ensembled finite networks have reduced posterior variance.
Weight decay and the use of a large learning rate break the correspondence between finite and infinite networks.
arXiv Detail & Related papers (2020-07-31T01:57:47Z) - Practical Quasi-Newton Methods for Training Deep Neural Networks [12.48022619079224]
In training, the number of variables and components of the gradient $n$ is often of the order of tens of millions and the Hessian has $n2$ elements.
We approximate the Hessian by a block-diagonal matrix and use the structure of the gradient and Hessian to further approximate these blocks.
Because the indefinite and highly variable nature of the Hessian in a DNN, we also propose a new damping approach to keep the upper as well as the lower bounds of the BFGS and L-BFGS approximations bounded.
arXiv Detail & Related papers (2020-06-16T02:27:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.