Gradient Descent on Neurons and its Link to Approximate Second-Order
Optimization
- URL: http://arxiv.org/abs/2201.12250v1
- Date: Fri, 28 Jan 2022 17:06:26 GMT
- Title: Gradient Descent on Neurons and its Link to Approximate Second-Order
Optimization
- Authors: Frederik Benzing
- Abstract summary: We show that Kronecker-Factored, block-diagonal curvature estimates (KFAC) significantly outperforms true second-order updates.
We also show that KFAC approximates a first-order gradient algorithm, which performs a gradient descent on rather than weights.
- Score: 0.913755431537592
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Second-order optimizers are thought to hold the potential to speed up neural
network training, but due to the enormous size of the curvature matrix, they
typically require approximations to be computationally tractable. The most
successful family of approximations are Kronecker-Factored, block-diagonal
curvature estimates (KFAC). Here, we combine tools from prior work to evaluate
exact second-order updates with careful ablations to establish a surprising
result: Due to its approximations, KFAC is not closely related to second-order
updates, and in particular, it significantly outperforms true second-order
updates. This challenges widely held believes and immediately raises the
question why KFAC performs so well. We answer this question by showing that
KFAC approximates a first-order algorithm, which performs gradient descent on
neurons rather than weights. Finally, we show that this optimizer often
improves over KFAC in terms of computational cost and data-efficiency.
Related papers
- Kronecker-Factored Approximate Curvature for Physics-Informed Neural Networks [3.7308074617637588]
We propose Kronecker-factored approximate curvature (KFAC) for PINN losses that greatly reduces the computational cost and allows scaling to much larger networks.
We find that our KFAC-based gradients are competitive with expensive second-order methods on small problems, scale more favorably to higher-dimensional neural networks and PDEs, and consistently outperform first-order methods and LBFGS.
arXiv Detail & Related papers (2024-05-24T14:36:02Z) - Inverse-Free Fast Natural Gradient Descent Method for Deep Learning [52.0693420699086]
We present a fast natural gradient descent (FNGD) method that only requires inversion during the first epoch.
FNGD exhibits similarities to the average sum in first-order methods, leading to the computational complexity of FNGD being comparable to that of first-order methods.
arXiv Detail & Related papers (2024-03-06T05:13:28Z) - Structured Inverse-Free Natural Gradient: Memory-Efficient & Numerically-Stable KFAC [26.275682325827706]
Second-order methods such as KFAC can be useful for neural net training.
However, they are often memory-inefficient since their preconditioning Kronecker factors are dense.
We formulate an inverse-free KFAC update and impose structures in the Kronecker factors, resulting in structured inverse-free gradient natural descent.
arXiv Detail & Related papers (2023-12-09T23:13:32Z) - Kronecker-Factored Approximate Curvature for Modern Neural Network
Architectures [85.76673783330334]
Two different settings of linear weight-sharing layers motivate two flavours of Kronecker-Factored Approximate Curvature (K-FAC)
We show they are exact for deep linear networks with weight-sharing in their respective setting.
We observe little difference between these two K-FAC variations when using them to train both a graph neural network and a vision transformer.
arXiv Detail & Related papers (2023-11-01T16:37:00Z) - Brand New K-FACs: Speeding up K-FAC with Online Decomposition Updates [0.0]
We exploit the exponential-average construction paradigm of the K-factors, and use online numerical linear algebra techniques.
We propose a K-factor inverse update which scales linearly in layer size.
We also propose an inverse application procedure which scales linearly as well.
arXiv Detail & Related papers (2022-10-16T09:41:23Z) - TCT: Convexifying Federated Learning using Bootstrapped Neural Tangent
Kernels [141.29156234353133]
State-of-the-art convex learning methods can perform far worse than their centralized counterparts when clients have dissimilar data distributions.
We show this disparity can largely be attributed to challenges presented by non-NISTity.
We propose a Train-Convexify neural network (TCT) procedure to sidestep this issue.
arXiv Detail & Related papers (2022-07-13T16:58:22Z) - Scalable K-FAC Training for Deep Neural Networks with Distributed
Preconditioning [19.04755792575149]
We propose DP-KFAC, a novel distributed preconditioning scheme for deep neural network (DNN) training.
DP-KFAC reduces computation overhead by 1.55x-1.65x, the communication cost by 2.79x-3.15x, and the memory footprint by 1.14x-1.47x in each second-order update.
arXiv Detail & Related papers (2022-06-30T09:22:25Z) - An Accelerated Variance-Reduced Conditional Gradient Sliding Algorithm
for First-order and Zeroth-order Optimization [111.24899593052851]
Conditional gradient algorithm (also known as the Frank-Wolfe algorithm) has recently regained popularity in the machine learning community.
ARCS is the first zeroth-order conditional gradient sliding type algorithms solving convex problems in zeroth-order optimization.
In first-order optimization, the convergence results of ARCS substantially outperform previous algorithms in terms of the number of gradient query oracle.
arXiv Detail & Related papers (2021-09-18T07:08:11Z) - Going Beyond Linear RL: Sample Efficient Neural Function Approximation [76.57464214864756]
We study function approximation with two-layer neural networks.
Our results significantly improve upon what can be attained with linear (or eluder dimension) methods.
arXiv Detail & Related papers (2021-07-14T03:03:56Z) - A Trace-restricted Kronecker-Factored Approximation to Natural Gradient [32.41025119083869]
We propose a new approximation to the Fisher information matrix called Trace-restricted Kronecker-factored Approximate Curvature (TKFAC)
Experiments show that our method has better performance compared with several state-of-the-art algorithms on some deep network architectures.
arXiv Detail & Related papers (2020-11-21T07:47:14Z) - Communication-Efficient Distributed Stochastic AUC Maximization with
Deep Neural Networks [50.42141893913188]
We study a distributed variable for large-scale AUC for a neural network as with a deep neural network.
Our model requires a much less number of communication rounds and still a number of communication rounds in theory.
Our experiments on several datasets show the effectiveness of our theory and also confirm our theory.
arXiv Detail & Related papers (2020-05-05T18:08:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.