A Convergence Analysis of Nesterov's Accelerated Gradient Method in
Training Deep Linear Neural Networks
- URL: http://arxiv.org/abs/2204.08306v1
- Date: Mon, 18 Apr 2022 13:24:12 GMT
- Title: A Convergence Analysis of Nesterov's Accelerated Gradient Method in
Training Deep Linear Neural Networks
- Authors: Xin Liu, Wei Tao and Zhisong Pan
- Abstract summary: Momentum methods are widely used in training networks for their fast trajectory.
We show that the convergence of the random number and $kappaO can converge to the global minimum.
We extend our analysis to deep linear ResNets and derive a similar result.
- Score: 21.994004684742812
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Momentum methods, including heavy-ball~(HB) and Nesterov's accelerated
gradient~(NAG), are widely used in training neural networks for their fast
convergence. However, there is a lack of theoretical guarantees for their
convergence and acceleration since the optimization landscape of the neural
network is non-convex. Nowadays, some works make progress towards understanding
the convergence of momentum methods in an over-parameterized regime, where the
number of the parameters exceeds that of the training instances. Nonetheless,
current results mainly focus on the two-layer neural network, which are far
from explaining the remarkable success of the momentum methods in training deep
neural networks. Motivated by this, we investigate the convergence of NAG with
constant learning rate and momentum parameter in training two architectures of
deep linear networks: deep fully-connected linear neural networks and deep
linear ResNets. Based on the over-parameterization regime, we first analyze the
residual dynamics induced by the training trajectory of NAG for a deep
fully-connected linear neural network under the random Gaussian initialization.
Our results show that NAG can converge to the global minimum at a $(1 -
\mathcal{O}(1/\sqrt{\kappa}))^t$ rate, where $t$ is the iteration number and
$\kappa > 1$ is a constant depending on the condition number of the feature
matrix. Compared to the $(1 - \mathcal{O}(1/{\kappa}))^t$ rate of GD, NAG
achieves an acceleration over GD. To the best of our knowledge, this is the
first theoretical guarantee for the convergence of NAG to the global minimum in
training deep neural networks. Furthermore, we extend our analysis to deep
linear ResNets and derive a similar convergence result.
Related papers
- Stochastic Gradient Descent for Two-layer Neural Networks [2.0349026069285423]
This paper presents a study on the convergence rates of the descent (SGD) algorithm when applied to overparameterized two-layer neural networks.
Our approach combines the Tangent Kernel (NTK) approximation with convergence analysis in the Reproducing Kernel Space (RKHS) generated by NTK.
Our research framework enables us to explore the intricate interplay between kernel methods and optimization processes, shedding light on the dynamics and convergence properties of neural networks.
arXiv Detail & Related papers (2024-07-10T13:58:57Z) - Speed Limits for Deep Learning [67.69149326107103]
Recent advancement in thermodynamics allows bounding the speed at which one can go from the initial weight distribution to the final distribution of the fully trained network.
We provide analytical expressions for these speed limits for linear and linearizable neural networks.
Remarkably, given some plausible scaling assumptions on the NTK spectra and spectral decomposition of the labels -- learning is optimal in a scaling sense.
arXiv Detail & Related papers (2023-07-27T06:59:46Z) - Gradient Descent in Neural Networks as Sequential Learning in RKBS [63.011641517977644]
We construct an exact power-series representation of the neural network in a finite neighborhood of the initial weights.
We prove that, regardless of width, the training sequence produced by gradient descent can be exactly replicated by regularized sequential learning.
arXiv Detail & Related papers (2023-02-01T03:18:07Z) - Neural Networks with Sparse Activation Induced by Large Bias: Tighter Analysis with Bias-Generalized NTK [86.45209429863858]
We study training one-hidden-layer ReLU networks in the neural tangent kernel (NTK) regime.
We show that the neural networks possess a different limiting kernel which we call textitbias-generalized NTK
We also study various properties of the neural networks with this new kernel.
arXiv Detail & Related papers (2023-01-01T02:11:39Z) - Provable Acceleration of Nesterov's Accelerated Gradient Method over Heavy Ball Method in Training Over-Parameterized Neural Networks [12.475834086073734]
First-order gradient method has been extensively employed in training neural networks.
Recent research has proved that the first neural-order method is capable of attaining a global minimum convergence.
arXiv Detail & Related papers (2022-08-08T07:13:26Z) - Mean-Field Analysis of Two-Layer Neural Networks: Global Optimality with
Linear Convergence Rates [7.094295642076582]
Mean-field regime is a theoretically attractive alternative to the NTK (lazy training) regime.
We establish a new linear convergence result for two-layer neural networks trained by continuous-time noisy descent in the mean-field regime.
arXiv Detail & Related papers (2022-05-19T21:05:40Z) - On Feature Learning in Neural Networks with Global Convergence
Guarantees [49.870593940818715]
We study the optimization of wide neural networks (NNs) via gradient flow (GF)
We show that when the input dimension is no less than the size of the training set, the training loss converges to zero at a linear rate under GF.
We also show empirically that, unlike in the Neural Tangent Kernel (NTK) regime, our multi-layer model exhibits feature learning and can achieve better generalization performance than its NTK counterpart.
arXiv Detail & Related papers (2022-04-22T15:56:43Z) - Provable Convergence of Nesterov Accelerated Method for
Over-Parameterized Neural Networks [7.40653399983911]
We analyze NAG for two fully connected neural network with ReLU activation.
We prove that the error of NAG to zero at a rate of Theta (1/sqrtkappa)$, where $kappa 1$ is determined by a rate of neural network.
arXiv Detail & Related papers (2021-07-05T07:40:35Z) - Optimal Rates for Averaged Stochastic Gradient Descent under Neural
Tangent Kernel Regime [50.510421854168065]
We show that the averaged gradient descent can achieve the minimax optimal convergence rate.
We show that the target function specified by the NTK of a ReLU network can be learned at the optimal convergence rate.
arXiv Detail & Related papers (2020-06-22T14:31:37Z) - A Generalized Neural Tangent Kernel Analysis for Two-layer Neural
Networks [87.23360438947114]
We show that noisy gradient descent with weight decay can still exhibit a " Kernel-like" behavior.
This implies that the training loss converges linearly up to a certain accuracy.
We also establish a novel generalization error bound for two-layer neural networks trained by noisy gradient descent with weight decay.
arXiv Detail & Related papers (2020-02-10T18:56:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.