A proof of convergence for the gradient descent optimization method with
random initializations in the training of neural networks with ReLU
activation for piecewise linear target functions
- URL: http://arxiv.org/abs/2108.04620v1
- Date: Tue, 10 Aug 2021 12:01:37 GMT
- Title: A proof of convergence for the gradient descent optimization method with
random initializations in the training of neural networks with ReLU
activation for piecewise linear target functions
- Authors: Arnulf Jentzen, Adrian Riekert
- Abstract summary: Gradient descent (GD) type optimization methods are the standard instrument to train artificial neural networks (ANNs) with rectified linear unit (ReLU) activation.
- Score: 3.198144010381572
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Gradient descent (GD) type optimization methods are the standard instrument
to train artificial neural networks (ANNs) with rectified linear unit (ReLU)
activation. Despite the great success of GD type optimization methods in
numerical simulations for the training of ANNs with ReLU activation, it remains
- even in the simplest situation of the plain vanilla GD optimization method
with random initializations and ANNs with one hidden layer - an open problem to
prove (or disprove) the conjecture that the risk of the GD optimization method
converges in the training of such ANNs to zero as the width of the ANNs, the
number of independent random initializations, and the number of GD steps
increase to infinity. In this article we prove this conjecture in the situation
where the probability distribution of the input data is equivalent to the
continuous uniform distribution on a compact interval, where the probability
distributions for the random initializations of the ANN parameters are standard
normal distributions, and where the target function under consideration is
continuous and piecewise affine linear. Roughly speaking, the key ingredients
in our mathematical convergence analysis are (i) to prove that suitable sets of
global minima of the risk functions are \emph{twice continuously differentiable
submanifolds of the ANN parameter spaces}, (ii) to prove that the Hessians of
the risk functions on these sets of global minima satisfy an appropriate
\emph{maximal rank condition}, and, thereafter, (iii) to apply the machinery in
[Fehrman, B., Gess, B., Jentzen, A., Convergence rates for the stochastic
gradient descent method for non-convex objective functions. J. Mach. Learn.
Res. 21(136): 1--48, 2020] to establish convergence of the GD optimization
method with random initializations.
Related papers
- A Mean-Field Analysis of Neural Stochastic Gradient Descent-Ascent for Functional Minimax Optimization [90.87444114491116]
This paper studies minimax optimization problems defined over infinite-dimensional function classes of overparametricized two-layer neural networks.
We address (i) the convergence of the gradient descent-ascent algorithm and (ii) the representation learning of the neural networks.
Results show that the feature representation induced by the neural networks is allowed to deviate from the initial one by the magnitude of $O(alpha-1)$, measured in terms of the Wasserstein distance.
arXiv Detail & Related papers (2024-04-18T16:46:08Z) - Stable Nonconvex-Nonconcave Training via Linear Interpolation [51.668052890249726]
This paper presents a theoretical analysis of linearahead as a principled method for stabilizing (large-scale) neural network training.
We argue that instabilities in the optimization process are often caused by the nonmonotonicity of the loss landscape and show how linear can help by leveraging the theory of nonexpansive operators.
arXiv Detail & Related papers (2023-10-20T12:45:12Z) - Constrained Optimization via Exact Augmented Lagrangian and Randomized
Iterative Sketching [55.28394191394675]
We develop an adaptive inexact Newton method for equality-constrained nonlinear, nonIBS optimization problems.
We demonstrate the superior performance of our method on benchmark nonlinear problems, constrained logistic regression with data from LVM, and a PDE-constrained problem.
arXiv Detail & Related papers (2023-05-28T06:33:37Z) - Stochastic Unrolled Federated Learning [85.6993263983062]
We introduce UnRolled Federated learning (SURF), a method that expands algorithm unrolling to federated learning.
Our proposed method tackles two challenges of this expansion, namely the need to feed whole datasets to the unrolleds and the decentralized nature of federated learning.
arXiv Detail & Related papers (2023-05-24T17:26:22Z) - Promises and Pitfalls of the Linearized Laplace in Bayesian Optimization [73.80101701431103]
The linearized-Laplace approximation (LLA) has been shown to be effective and efficient in constructing Bayesian neural networks.
We study the usefulness of the LLA in Bayesian optimization and highlight its strong performance and flexibility.
arXiv Detail & Related papers (2023-04-17T14:23:43Z) - On the existence of global minima and convergence analyses for gradient
descent methods in the training of deep neural networks [3.198144010381572]
We study feedforward deep ReLU ANNs with an arbitrarily large number of hidden layers.
We prove convergence of the risk of the GD optimization method with randoms in the training of such ANNs.
We also study solutions of gradient flow differential equations.
arXiv Detail & Related papers (2021-12-17T18:55:40Z) - Convergence proof for stochastic gradient descent in the training of
deep neural networks with ReLU activation for constant target functions [1.7149364927872015]
gradient descent (SGD) type optimization methods perform very effectively in the training of deep neural networks (DNNs)
In this work we study SGD type optimization methods in the training of fully-connected feedforward DNNs with rectified linear unit (ReLU) activation.
arXiv Detail & Related papers (2021-12-13T11:45:36Z) - Existence, uniqueness, and convergence rates for gradient flows in the
training of artificial neural networks with ReLU activation [2.4087148947930634]
The training of artificial neural networks (ANNs) with rectified linear unit (ReLU) activation via gradient descent (GD) type optimization schemes is nowadays a common industrially relevant procedure.
Till this day in the scientific literature there is in general no mathematical convergence analysis which explains the numerical success of GD type schemes in the training of ANNs with ReLU activation.
arXiv Detail & Related papers (2021-08-18T12:06:19Z) - Convergence analysis for gradient flows in the training of artificial
neural networks with ReLU activation [3.198144010381572]
Gradient descent (GD) type optimization schemes are the standard methods to train artificial neural networks (ANNs) with rectified linear unit (ReLU) activation.
Most of the key difficulties in the mathematical convergence analysis of GD type optimization schemes in the training of ANNs with ReLU activation seem to be already present in the dynamics of the corresponding GF differential equations.
arXiv Detail & Related papers (2021-07-09T15:08:30Z) - Momentum Accelerates the Convergence of Stochastic AUPRC Maximization [80.8226518642952]
We study optimization of areas under precision-recall curves (AUPRC), which is widely used for imbalanced tasks.
We develop novel momentum methods with a better iteration of $O (1/epsilon4)$ for finding an $epsilon$stationary solution.
We also design a novel family of adaptive methods with the same complexity of $O (1/epsilon4)$, which enjoy faster convergence in practice.
arXiv Detail & Related papers (2021-07-02T16:21:52Z) - A proof of convergence for gradient descent in the training of
artificial neural networks for constant target functions [3.4792548480344254]
We show that the risk function of the gradient descent method does indeed converge to zero.
A key contribution of this work is to explicitly specify a Lyapunov function for the gradient flow system of the ANN parameters.
arXiv Detail & Related papers (2021-02-19T13:33:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.