Related papers: Gradient Descent on Neurons and its Link to Approximate Second-Order Optimization

Gradient Descent on Neurons and its Link to Approximate Second-Order Optimization

URL: http://arxiv.org/abs/2201.12250v1
Date: Fri, 28 Jan 2022 17:06:26 GMT
Title: Gradient Descent on Neurons and its Link to Approximate Second-Order Optimization
Authors: Frederik Benzing
Abstract summary: We show that Kronecker-Factored, block-diagonal curvature estimates (KFAC) significantly outperforms true second-order updates. We also show that KFAC approximates a first-order gradient algorithm, which performs a gradient descent on rather than weights.
Score: 0.913755431537592
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Second-order optimizers are thought to hold the potential to speed up neural network training, but due to the enormous size of the curvature matrix, they typically require approximations to be computationally tractable. The most successful family of approximations are Kronecker-Factored, block-diagonal curvature estimates (KFAC). Here, we combine tools from prior work to evaluate exact second-order updates with careful ablations to establish a surprising result: Due to its approximations, KFAC is not closely related to second-order updates, and in particular, it significantly outperforms true second-order updates. This challenges widely held believes and immediately raises the question why KFAC performs so well. We answer this question by showing that KFAC approximates a first-order algorithm, which performs gradient descent on neurons rather than weights. Finally, we show that this optimizer often improves over KFAC in terms of computational cost and data-efficiency.

Related papers

Stochastic Primal-Dual Double Block-Coordinate for Two-way Partial AUC Maximization [56.805574957824135]
Two-way partial AUCAUC is a critical performance metric for binary classification with imbalanced data.<n>Existing algorithms for TPAUC optimization remain under-explored.<n>We introduce two innovative double-coordinate block-coordinate algorithms for TPAUC optimization.
arXiv Detail & Related papers (2025-05-28T03:55:05Z)
Kronecker-Factored Approximate Curvature for Physics-Informed Neural Networks [3.7308074617637588]
We propose Kronecker-factored approximate curvature (KFAC) for PINN losses that greatly reduces the computational cost and allows scaling to much larger networks. We find that our KFAC-based gradients are competitive with expensive second-order methods on small problems, scale more favorably to higher-dimensional neural networks and PDEs, and consistently outperform first-order methods and LBFGS.
arXiv Detail & Related papers (2024-05-24T14:36:02Z)
Inverse-Free Fast Natural Gradient Descent Method for Deep Learning [52.0693420699086]
We present a fast natural gradient descent (FNGD) method that only requires inversion during the first epoch. FNGD exhibits similarities to the average sum in first-order methods, leading to the computational complexity of FNGD being comparable to that of first-order methods.
arXiv Detail & Related papers (2024-03-06T05:13:28Z)
Structured Inverse-Free Natural Gradient: Memory-Efficient & Numerically-Stable KFAC [26.275682325827706]
Second-order methods such as KFAC can be useful for neural net training. However, they are often memory-inefficient since their preconditioning Kronecker factors are dense. We formulate an inverse-free KFAC update and impose structures in the Kronecker factors, resulting in structured inverse-free gradient natural descent.
arXiv Detail & Related papers (2023-12-09T23:13:32Z)
Kronecker-Factored Approximate Curvature for Modern Neural Network Architectures [85.76673783330334]
Two different settings of linear weight-sharing layers motivate two flavours of Kronecker-Factored Approximate Curvature (K-FAC) We show they are exact for deep linear networks with weight-sharing in their respective setting. We observe little difference between these two K-FAC variations when using them to train both a graph neural network and a vision transformer.
arXiv Detail & Related papers (2023-11-01T16:37:00Z)
Brand New K-FACs: Speeding up K-FAC with Online Decomposition Updates [0.0]
We exploit the exponential-average construction paradigm of the K-factors, and use online numerical linear algebra techniques. We propose a K-factor inverse update which scales linearly in layer size. We also propose an inverse application procedure which scales linearly as well.
arXiv Detail & Related papers (2022-10-16T09:41:23Z)
TCT: Convexifying Federated Learning using Bootstrapped Neural Tangent Kernels [141.29156234353133]
State-of-the-art convex learning methods can perform far worse than their centralized counterparts when clients have dissimilar data distributions. We show this disparity can largely be attributed to challenges presented by non-NISTity. We propose a Train-Convexify neural network (TCT) procedure to sidestep this issue.
arXiv Detail & Related papers (2022-07-13T16:58:22Z)
Scalable K-FAC Training for Deep Neural Networks with Distributed Preconditioning [19.04755792575149]
We propose DP-KFAC, a novel distributed preconditioning scheme for deep neural network (DNN) training. DP-KFAC reduces computation overhead by 1.55x-1.65x, the communication cost by 2.79x-3.15x, and the memory footprint by 1.14x-1.47x in each second-order update.
arXiv Detail & Related papers (2022-06-30T09:22:25Z)
An Accelerated Variance-Reduced Conditional Gradient Sliding Algorithm for First-order and Zeroth-order Optimization [111.24899593052851]
Conditional gradient algorithm (also known as the Frank-Wolfe algorithm) has recently regained popularity in the machine learning community. ARCS is the first zeroth-order conditional gradient sliding type algorithms solving convex problems in zeroth-order optimization. In first-order optimization, the convergence results of ARCS substantially outperform previous algorithms in terms of the number of gradient query oracle.
arXiv Detail & Related papers (2021-09-18T07:08:11Z)
Going Beyond Linear RL: Sample Efficient Neural Function Approximation [76.57464214864756]
We study function approximation with two-layer neural networks. Our results significantly improve upon what can be attained with linear (or eluder dimension) methods.
arXiv Detail & Related papers (2021-07-14T03:03:56Z)
A Trace-restricted Kronecker-Factored Approximation to Natural Gradient [32.41025119083869]
We propose a new approximation to the Fisher information matrix called Trace-restricted Kronecker-factored Approximate Curvature (TKFAC) Experiments show that our method has better performance compared with several state-of-the-art algorithms on some deep network architectures.
arXiv Detail & Related papers (2020-11-21T07:47:14Z)
Communication-Efficient Distributed Stochastic AUC Maximization with Deep Neural Networks [50.42141893913188]
We study a distributed variable for large-scale AUC for a neural network as with a deep neural network. Our model requires a much less number of communication rounds and still a number of communication rounds in theory. Our experiments on several datasets show the effectiveness of our theory and also confirm our theory.
arXiv Detail & Related papers (2020-05-05T18:08:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.