Provable Convergence of Nesterov Accelerated Method for
Over-Parameterized Neural Networks
- URL: http://arxiv.org/abs/2107.01832v1
- Date: Mon, 5 Jul 2021 07:40:35 GMT
- Title: Provable Convergence of Nesterov Accelerated Method for
Over-Parameterized Neural Networks
- Authors: Xin Liu and Zhisong Pan
- Abstract summary: We analyze NAG for two fully connected neural network with ReLU activation.
We prove that the error of NAG to zero at a rate of Theta (1/sqrtkappa)$, where $kappa 1$ is determined by a rate of neural network.
- Score: 7.40653399983911
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Despite the empirical success of deep learning, it still lacks theoretical
understandings to explain why randomly initialized neural network trained by
first-order optimization methods is able to achieve zero training loss, even
though its landscape is non-convex and non-smooth. Recently, there are some
works to demystifies this phenomenon under over-parameterized regime. In this
work, we make further progress on this area by considering a commonly used
momentum optimization algorithm: Nesterov accelerated method (NAG). We analyze
the convergence of NAG for two-layer fully connected neural network with ReLU
activation. Specifically, we prove that the error of NAG converges to zero at a
linear convergence rate $1-\Theta(1/\sqrt{\kappa})$, where $\kappa > 1$ is
determined by the initialization and the architecture of neural network.
Comparing to the rate $1-\Theta(1/\kappa)$ of gradient descent, NAG achieves an
acceleration. Besides, it also validates NAG and Heavy-ball method can achieve
a similar convergence rate.
Related papers
- Convergence Analysis of Natural Gradient Descent for Over-parameterized Physics-Informed Neural Networks [3.680127959836384]
First-order methods, such as gradient descent (GD) and quadratic descent gradient (SGD) have been proven effective in training neural networks.
However, the learning rate of GD for training two-layer neural networks exhibits poor dependence on the sample size and the Gram matrix.
In this paper, we show that for the $L2$ regression problems, the learning rate can be improved from $mathcalO(1)$ to $mathcalO(1)$, which implies that GD actually enjoys a faster convergence rate.
arXiv Detail & Related papers (2024-08-01T14:06:34Z) - Stable Nonconvex-Nonconcave Training via Linear Interpolation [51.668052890249726]
This paper presents a theoretical analysis of linearahead as a principled method for stabilizing (large-scale) neural network training.
We argue that instabilities in the optimization process are often caused by the nonmonotonicity of the loss landscape and show how linear can help by leveraging the theory of nonexpansive operators.
arXiv Detail & Related papers (2023-10-20T12:45:12Z) - Speed Limits for Deep Learning [67.69149326107103]
Recent advancement in thermodynamics allows bounding the speed at which one can go from the initial weight distribution to the final distribution of the fully trained network.
We provide analytical expressions for these speed limits for linear and linearizable neural networks.
Remarkably, given some plausible scaling assumptions on the NTK spectra and spectral decomposition of the labels -- learning is optimal in a scaling sense.
arXiv Detail & Related papers (2023-07-27T06:59:46Z) - Globally Optimal Training of Neural Networks with Threshold Activation
Functions [63.03759813952481]
We study weight decay regularized training problems of deep neural networks with threshold activations.
We derive a simplified convex optimization formulation when the dataset can be shattered at a certain layer of the network.
arXiv Detail & Related papers (2023-03-06T18:59:13Z) - Gradient Descent in Neural Networks as Sequential Learning in RKBS [63.011641517977644]
We construct an exact power-series representation of the neural network in a finite neighborhood of the initial weights.
We prove that, regardless of width, the training sequence produced by gradient descent can be exactly replicated by regularized sequential learning.
arXiv Detail & Related papers (2023-02-01T03:18:07Z) - Provable Acceleration of Nesterov's Accelerated Gradient Method over Heavy Ball Method in Training Over-Parameterized Neural Networks [12.475834086073734]
First-order gradient method has been extensively employed in training neural networks.
Recent research has proved that the first neural-order method is capable of attaining a global minimum convergence.
arXiv Detail & Related papers (2022-08-08T07:13:26Z) - A Convergence Analysis of Nesterov's Accelerated Gradient Method in
Training Deep Linear Neural Networks [21.994004684742812]
Momentum methods are widely used in training networks for their fast trajectory.
We show that the convergence of the random number and $kappaO can converge to the global minimum.
We extend our analysis to deep linear ResNets and derive a similar result.
arXiv Detail & Related papers (2022-04-18T13:24:12Z) - A Convergence Theory Towards Practical Over-parameterized Deep Neural
Networks [56.084798078072396]
We take a step towards closing the gap between theory and practice by significantly improving the known theoretical bounds on both the network width and the convergence time.
We show that convergence to a global minimum is guaranteed for networks with quadratic widths in the sample size and linear in their depth at a time logarithmic in both.
Our analysis and convergence bounds are derived via the construction of a surrogate network with fixed activation patterns that can be transformed at any time to an equivalent ReLU network of a reasonable size.
arXiv Detail & Related papers (2021-01-12T00:40:45Z) - Regularization Matters: A Nonparametric Perspective on Overparametrized
Neural Network [20.132432350255087]
Overparametrized neural networks trained by tangent descent (GD) can provably overfit any training data.
This paper studies how well overparametrized neural networks can recover the true target function in the presence of random noises.
arXiv Detail & Related papers (2020-07-06T01:02:23Z) - A Revision of Neural Tangent Kernel-based Approaches for Neural Networks [34.75076385561115]
We use the neural tangent kernel to show that networks can fit any finite training sample perfectly.
A simple and analytic kernel function was derived as indeed equivalent to a fully-trained network.
Our tighter analysis resolves the scaling problem and enables the validation of the original NTK-based results.
arXiv Detail & Related papers (2020-07-02T05:07:55Z) - Optimal Rates for Averaged Stochastic Gradient Descent under Neural
Tangent Kernel Regime [50.510421854168065]
We show that the averaged gradient descent can achieve the minimax optimal convergence rate.
We show that the target function specified by the NTK of a ReLU network can be learned at the optimal convergence rate.
arXiv Detail & Related papers (2020-06-22T14:31:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.