Understanding Approximate Fisher Information for Fast Convergence of
Natural Gradient Descent in Wide Neural Networks
- URL: http://arxiv.org/abs/2010.00879v3
- Date: Mon, 7 Dec 2020 06:36:00 GMT
- Title: Understanding Approximate Fisher Information for Fast Convergence of
Natural Gradient Descent in Wide Neural Networks
- Authors: Ryo Karakida and Kazuki Osawa
- Abstract summary: Natural Gradient Descent (NGD) helps to accelerate the convergence of descent gradient dynamics.
It requires approximations in large-scale deep neural networks because of its high computational cost.
Empirical studies have confirmed that some NGD methods with approximate Fisher information converge sufficiently fast in practice.
- Score: 13.572168969227011
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Natural Gradient Descent (NGD) helps to accelerate the convergence of
gradient descent dynamics, but it requires approximations in large-scale deep
neural networks because of its high computational cost. Empirical studies have
confirmed that some NGD methods with approximate Fisher information converge
sufficiently fast in practice. Nevertheless, it remains unclear from the
theoretical perspective why and under what conditions such heuristic
approximations work well. In this work, we reveal that, under specific
conditions, NGD with approximate Fisher information achieves the same fast
convergence to global minima as exact NGD. We consider deep neural networks in
the infinite-width limit, and analyze the asymptotic training dynamics of NGD
in function space via the neural tangent kernel. In the function space, the
training dynamics with the approximate Fisher information are identical to
those with the exact Fisher information, and they converge quickly. The fast
convergence holds in layer-wise approximations; for instance, in block diagonal
approximation where each block corresponds to a layer as well as in block
tri-diagonal and K-FAC approximations. We also find that a unit-wise
approximation achieves the same fast convergence under some assumptions. All of
these different approximations have an isotropic gradient in the function
space, and this plays a fundamental role in achieving the same convergence
properties in training. Thus, the current study gives a novel and unified
theoretical foundation with which to understand NGD methods in deep learning.
Related papers
- Convergence Analysis of the Wasserstein Proximal Algorithm beyond Geodesic Convexity [13.468026138183623]
The proximal descent algorithm is a powerful tool to nonlinear and nonsmooths in a general metric space.
We show a faster convergence rate than the Euclidean algorithm on meanfield neural networks.
arXiv Detail & Related papers (2025-01-25T00:00:50Z) - Convergence analysis of wide shallow neural operators within the framework of Neural Tangent Kernel [4.313136216120379]
We conduct the convergence analysis of gradient descent for the wide shallow neural operators and physics-informed shallow neural operators within the framework of Neural Tangent Kernel (NTK)
Under the setting of over-parametrization, gradient descent can find the global minimum regardless of whether it is in continuous time or discrete time.
arXiv Detail & Related papers (2024-12-07T05:47:28Z) - Adaptive Federated Learning Over the Air [108.62635460744109]
We propose a federated version of adaptive gradient methods, particularly AdaGrad and Adam, within the framework of over-the-air model training.
Our analysis shows that the AdaGrad-based training algorithm converges to a stationary point at the rate of $mathcalO( ln(T) / T 1 - frac1alpha ).
arXiv Detail & Related papers (2024-03-11T09:10:37Z) - Approximation Results for Gradient Descent trained Neural Networks [0.0]
The networks are fully connected constant depth increasing width.
The continuous kernel error norm implies an approximation under the natural smoothness assumption required for smooth functions.
arXiv Detail & Related papers (2023-09-09T18:47:55Z) - Speed Limits for Deep Learning [67.69149326107103]
Recent advancement in thermodynamics allows bounding the speed at which one can go from the initial weight distribution to the final distribution of the fully trained network.
We provide analytical expressions for these speed limits for linear and linearizable neural networks.
Remarkably, given some plausible scaling assumptions on the NTK spectra and spectral decomposition of the labels -- learning is optimal in a scaling sense.
arXiv Detail & Related papers (2023-07-27T06:59:46Z) - A Bootstrap Algorithm for Fast Supervised Learning [0.0]
Training a neural network (NN) typically relies on some type of curve-following method, such as gradient descent (and gradient descent (SGD)), ADADELTA, ADAM or limited memory algorithms.
Convergence for these algorithms usually relies on having access to a large quantity of observations in order to achieve a high level of accuracy and, with certain classes of functions, these algorithms could take multiple epochs of data points to catch on.
Herein, a different technique with the potential of achieving dramatically better speeds of convergence is explored: it does not curve-follow but rather relies on 'decoupling' hidden layers and on
arXiv Detail & Related papers (2023-05-04T18:28:18Z) - Implicit Stochastic Gradient Descent for Training Physics-informed
Neural Networks [51.92362217307946]
Physics-informed neural networks (PINNs) have effectively been demonstrated in solving forward and inverse differential equation problems.
PINNs are trapped in training failures when the target functions to be approximated exhibit high-frequency or multi-scale features.
In this paper, we propose to employ implicit gradient descent (ISGD) method to train PINNs for improving the stability of training process.
arXiv Detail & Related papers (2023-03-03T08:17:47Z) - Gradient Descent in Neural Networks as Sequential Learning in RKBS [63.011641517977644]
We construct an exact power-series representation of the neural network in a finite neighborhood of the initial weights.
We prove that, regardless of width, the training sequence produced by gradient descent can be exactly replicated by regularized sequential learning.
arXiv Detail & Related papers (2023-02-01T03:18:07Z) - Deep Architecture Connectivity Matters for Its Convergence: A
Fine-Grained Analysis [94.64007376939735]
We theoretically characterize the impact of connectivity patterns on the convergence of deep neural networks (DNNs) under gradient descent training.
We show that by a simple filtration on "unpromising" connectivity patterns, we can trim down the number of models to evaluate.
arXiv Detail & Related papers (2022-05-11T17:43:54Z) - A Convergence Analysis of Nesterov's Accelerated Gradient Method in
Training Deep Linear Neural Networks [21.994004684742812]
Momentum methods are widely used in training networks for their fast trajectory.
We show that the convergence of the random number and $kappaO can converge to the global minimum.
We extend our analysis to deep linear ResNets and derive a similar result.
arXiv Detail & Related papers (2022-04-18T13:24:12Z) - A Convergence Theory Towards Practical Over-parameterized Deep Neural
Networks [56.084798078072396]
We take a step towards closing the gap between theory and practice by significantly improving the known theoretical bounds on both the network width and the convergence time.
We show that convergence to a global minimum is guaranteed for networks with quadratic widths in the sample size and linear in their depth at a time logarithmic in both.
Our analysis and convergence bounds are derived via the construction of a surrogate network with fixed activation patterns that can be transformed at any time to an equivalent ReLU network of a reasonable size.
arXiv Detail & Related papers (2021-01-12T00:40:45Z) - Optimal Rates for Averaged Stochastic Gradient Descent under Neural
Tangent Kernel Regime [50.510421854168065]
We show that the averaged gradient descent can achieve the minimax optimal convergence rate.
We show that the target function specified by the NTK of a ReLU network can be learned at the optimal convergence rate.
arXiv Detail & Related papers (2020-06-22T14:31:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.