Provable Accelerated Convergence of Nesterov's Momentum for Deep ReLU
Neural Networks
- URL: http://arxiv.org/abs/2306.08109v2
- Date: Fri, 5 Jan 2024 04:03:51 GMT
- Title: Provable Accelerated Convergence of Nesterov's Momentum for Deep ReLU
Neural Networks
- Authors: Fangshuo Liao, Anastasios Kyrillidis
- Abstract summary: Current state-of-the-art analyses on the convergence of gradient descent for training neural networks focus on characterizing properties of the loss landscape.
We consider a new class of objective functions, where only a subset of the parameters satisfies strong convexity, and show Nesterov's momentum acceleration in theory.
We provide two realizations of the problem class, one of which is deep ReLU networks, which --to the best of our knowledge-constitutes this work the first that proves accelerated convergence rate for non-trivial neural network architectures.
- Score: 12.763567932588591
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Current state-of-the-art analyses on the convergence of gradient descent for
training neural networks focus on characterizing properties of the loss
landscape, such as the Polyak-Lojaciewicz (PL) condition and the restricted
strong convexity. While gradient descent converges linearly under such
conditions, it remains an open question whether Nesterov's momentum enjoys
accelerated convergence under similar settings and assumptions. In this work,
we consider a new class of objective functions, where only a subset of the
parameters satisfies strong convexity, and show Nesterov's momentum achieves
acceleration in theory for this objective class. We provide two realizations of
the problem class, one of which is deep ReLU networks, which --to the best of
our knowledge--constitutes this work the first that proves accelerated
convergence rate for non-trivial neural network architectures.
Related papers
- On the Dynamics Under the Unhinged Loss and Beyond [104.49565602940699]
We introduce the unhinged loss, a concise loss function, that offers more mathematical opportunities to analyze closed-form dynamics.
The unhinged loss allows for considering more practical techniques, such as time-vary learning rates and feature normalization.
arXiv Detail & Related papers (2023-12-13T02:11:07Z) - Convergence Analysis for Learning Orthonormal Deep Linear Neural
Networks [27.29463801531576]
We provide convergence analysis for training orthonormal deep linear neural networks.
Our results shed light on how increasing the number of hidden layers can impact the convergence speed.
arXiv Detail & Related papers (2023-11-24T18:46:54Z) - Stable Nonconvex-Nonconcave Training via Linear Interpolation [51.668052890249726]
This paper presents a theoretical analysis of linearahead as a principled method for stabilizing (large-scale) neural network training.
We argue that instabilities in the optimization process are often caused by the nonmonotonicity of the loss landscape and show how linear can help by leveraging the theory of nonexpansive operators.
arXiv Detail & Related papers (2023-10-20T12:45:12Z) - Linear convergence of forward-backward accelerated algorithms without knowledge of the modulus of strong convexity [14.0409219811182]
We show that both Nesterov's accelerated gradient descent (NAG) and FISTA exhibit linear convergence for strongly convex functions.
We emphasize the distinctive approach employed in crafting the Lyapunov function, which involves a dynamically adapting coefficient of kinetic energy.
arXiv Detail & Related papers (2023-06-16T08:58:40Z) - Beyond the Edge of Stability via Two-step Gradient Updates [49.03389279816152]
Gradient Descent (GD) is a powerful workhorse of modern machine learning.
GD's ability to find local minimisers is only guaranteed for losses with Lipschitz gradients.
This work focuses on simple, yet representative, learning problems via analysis of two-step gradient updates.
arXiv Detail & Related papers (2022-06-08T21:32:50Z) - A Convergence Analysis of Nesterov's Accelerated Gradient Method in
Training Deep Linear Neural Networks [21.994004684742812]
Momentum methods are widely used in training networks for their fast trajectory.
We show that the convergence of the random number and $kappaO can converge to the global minimum.
We extend our analysis to deep linear ResNets and derive a similar result.
arXiv Detail & Related papers (2022-04-18T13:24:12Z) - Improved Overparametrization Bounds for Global Convergence of Stochastic
Gradient Descent for Shallow Neural Networks [1.14219428942199]
We study the overparametrization bounds required for the global convergence of gradient descent algorithm for a class of one hidden layer feed-forward neural networks.
arXiv Detail & Related papers (2022-01-28T11:30:06Z) - Convex Analysis of the Mean Field Langevin Dynamics [49.66486092259375]
convergence rate analysis of the mean field Langevin dynamics is presented.
$p_q$ associated with the dynamics allows us to develop a convergence theory parallel to classical results in convex optimization.
arXiv Detail & Related papers (2022-01-25T17:13:56Z) - Improved Analysis of Clipping Algorithms for Non-convex Optimization [19.507750439784605]
Recently, citetzhang 2019gradient show that clipped (stochastic) Gradient Descent (GD) converges faster than vanilla GD/SGD.
Experiments confirm the superiority of clipping-based methods in deep learning tasks.
arXiv Detail & Related papers (2020-10-05T14:36:59Z) - Optimal Rates for Averaged Stochastic Gradient Descent under Neural
Tangent Kernel Regime [50.510421854168065]
We show that the averaged gradient descent can achieve the minimax optimal convergence rate.
We show that the target function specified by the NTK of a ReLU network can be learned at the optimal convergence rate.
arXiv Detail & Related papers (2020-06-22T14:31:37Z) - An Ode to an ODE [78.97367880223254]
We present a new paradigm for Neural ODE algorithms, called ODEtoODE, where time-dependent parameters of the main flow evolve according to a matrix flow on the group O(d)
This nested system of two flows provides stability and effectiveness of training and provably solves the gradient vanishing-explosion problem.
arXiv Detail & Related papers (2020-06-19T22:05:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.