Related papers: Provable Acceleration of Nesterov's Accelerated Gradient Method over Heavy Ball Method in Training Over-Parameterized Neural Networks

Provable Acceleration of Nesterov's Accelerated Gradient Method over Heavy Ball Method in Training Over-Parameterized Neural Networks

URL: http://arxiv.org/abs/2208.03941v4
Date: Wed, 8 May 2024 15:34:45 GMT
Title: Provable Acceleration of Nesterov's Accelerated Gradient Method over Heavy Ball Method in Training Over-Parameterized Neural Networks
Authors: Xin Liu, Wei Tao, Wei Li, Dazhi Zhan, Jun Wang, Zhisong Pan,
Abstract summary: First-order gradient method has been extensively employed in training neural networks. Recent research has proved that the first neural-order method is capable of attaining a global minimum convergence.
Score: 12.475834086073734
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Due to its simplicity and efficiency, the first-order gradient method has been extensively employed in training neural networks. Although the optimization problem of the neural network is non-convex, recent research has proved that the first-order method is capable of attaining a global minimum during training over-parameterized neural networks, where the number of parameters is significantly larger than that of training instances. Momentum methods, including the heavy ball (HB) method and Nesterov's accelerated gradient (NAG) method, are the workhorse of first-order gradient methods owning to their accelerated convergence. In practice, NAG often exhibits superior performance than HB. However, current theoretical works fail to distinguish their convergence difference in training neural networks. To fill this gap, we consider the training problem of the two-layer ReLU neural network under over-parameterization and random initialization. Leveraging high-resolution dynamical systems and neural tangent kernel (NTK) theory, our result not only establishes tighter upper bounds of the convergence rate for both HB and NAG, but also provides the first theoretical guarantee for the acceleration of NAG over HB in training neural networks. Finally, we validate our theoretical results on three benchmark datasets.

Related papers

Post-Hoc Uncertainty Quantification in Pre-Trained Neural Networks via Activation-Level Gaussian Processes [0.15705429611931052]
We introduce the Gaussian Process Activation function (GAPA) to capture neuron-level uncertainties. Our approach operates in a post-hoc manner, preserving the original mean predictions of the pre-trained neural network.
arXiv Detail & Related papers (2025-02-28T11:29:06Z)
Stochastic Gradient Descent for Two-layer Neural Networks [2.0349026069285423]
This paper presents a study on the convergence rates of the descent (SGD) algorithm when applied to overparameterized two-layer neural networks. Our approach combines the Tangent Kernel (NTK) approximation with convergence analysis in the Reproducing Kernel Space (RKHS) generated by NTK. Our research framework enables us to explore the intricate interplay between kernel methods and optimization processes, shedding light on the dynamics and convergence properties of neural networks.
arXiv Detail & Related papers (2024-07-10T13:58:57Z)
Speed Limits for Deep Learning [67.69149326107103]
Recent advancement in thermodynamics allows bounding the speed at which one can go from the initial weight distribution to the final distribution of the fully trained network. We provide analytical expressions for these speed limits for linear and linearizable neural networks. Remarkably, given some plausible scaling assumptions on the NTK spectra and spectral decomposition of the labels -- learning is optimal in a scaling sense.
arXiv Detail & Related papers (2023-07-27T06:59:46Z)
Globally Optimal Training of Neural Networks with Threshold Activation Functions [63.03759813952481]
We study weight decay regularized training problems of deep neural networks with threshold activations. We derive a simplified convex optimization formulation when the dataset can be shattered at a certain layer of the network.
arXiv Detail & Related papers (2023-03-06T18:59:13Z)
Implicit Stochastic Gradient Descent for Training Physics-informed Neural Networks [51.92362217307946]
Physics-informed neural networks (PINNs) have effectively been demonstrated in solving forward and inverse differential equation problems. PINNs are trapped in training failures when the target functions to be approximated exhibit high-frequency or multi-scale features. In this paper, we propose to employ implicit gradient descent (ISGD) method to train PINNs for improving the stability of training process.
arXiv Detail & Related papers (2023-03-03T08:17:47Z)
Gradient Descent in Neural Networks as Sequential Learning in RKBS [63.011641517977644]
We construct an exact power-series representation of the neural network in a finite neighborhood of the initial weights. We prove that, regardless of width, the training sequence produced by gradient descent can be exactly replicated by regularized sequential learning.
arXiv Detail & Related papers (2023-02-01T03:18:07Z)
A Convergence Analysis of Nesterov's Accelerated Gradient Method in Training Deep Linear Neural Networks [21.994004684742812]
Momentum methods are widely used in training networks for their fast trajectory. We show that the convergence of the random number and $kappaO can converge to the global minimum. We extend our analysis to deep linear ResNets and derive a similar result.
arXiv Detail & Related papers (2022-04-18T13:24:12Z)
Edge of chaos as a guiding principle for modern neural network training [19.419382003562976]
We study the role of various hyperparameters in modern neural network training algorithms in terms of the order-chaos phase diagram. In particular, we study a fully analytical feedforward neural network trained on the widely adopted Fashion-MNIST dataset.
arXiv Detail & Related papers (2021-07-20T12:17:55Z)
Finite Versus Infinite Neural Networks: an Empirical Study [69.07049353209463]
kernel methods outperform fully-connected finite-width networks. Centered and ensembled finite networks have reduced posterior variance. Weight decay and the use of a large learning rate break the correspondence between finite and infinite networks.
arXiv Detail & Related papers (2020-07-31T01:57:47Z)
Semi-Implicit Back Propagation [1.5533842336139065]
We propose a semi-implicit back propagation method for neural network training. The difference on the neurons are propagated in a backward fashion and the parameters are updated with proximal mapping. Experiments on both MNIST and CIFAR-10 demonstrate that the proposed algorithm leads to better performance in terms of both loss decreasing and training/validation accuracy.
arXiv Detail & Related papers (2020-02-10T03:26:09Z)
MSE-Optimal Neural Network Initialization via Layer Fusion [68.72356718879428]
Deep neural networks achieve state-of-the-art performance for a range of classification and inference tasks. The use of gradient combined nonvolutionity renders learning susceptible to novel problems. We propose fusing neighboring layers of deeper networks that are trained with random variables.
arXiv Detail & Related papers (2020-01-28T18:25:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.