On the existence of global minima and convergence analyses for gradient
descent methods in the training of deep neural networks
- URL: http://arxiv.org/abs/2112.09684v1
- Date: Fri, 17 Dec 2021 18:55:40 GMT
- Title: On the existence of global minima and convergence analyses for gradient
descent methods in the training of deep neural networks
- Authors: Arnulf Jentzen, Adrian Riekert
- Abstract summary: We study feedforward deep ReLU ANNs with an arbitrarily large number of hidden layers.
We prove convergence of the risk of the GD optimization method with randoms in the training of such ANNs.
We also study solutions of gradient flow differential equations.
- Score: 3.198144010381572
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this article we study fully-connected feedforward deep ReLU ANNs with an
arbitrarily large number of hidden layers and we prove convergence of the risk
of the GD optimization method with random initializations in the training of
such ANNs under the assumption that the unnormalized probability density
function of the probability distribution of the input data of the considered
supervised learning problem is piecewise polynomial, under the assumption that
the target function (describing the relationship between input data and the
output data) is piecewise polynomial, and under the assumption that the risk
function of the considered supervised learning problem admits at least one
regular global minimum. In addition, in the special situation of shallow ANNs
with just one hidden layer and one-dimensional input we also verify this
assumption by proving in the training of such shallow ANNs that for every
Lipschitz continuous target function there exists a global minimum in the risk
landscape. Finally, in the training of deep ANNs with ReLU activation we also
study solutions of gradient flow (GF) differential equations and we prove that
every non-divergent GF trajectory converges with a polynomial rate of
convergence to a critical point (in the sense of limiting Fr\'echet
subdifferentiability). Our mathematical convergence analysis builds up on tools
from real algebraic geometry such as the concept of semi-algebraic functions
and generalized Kurdyka-Lojasiewicz inequalities, on tools from functional
analysis such as the Arzel\`a-Ascoli theorem, on tools from nonsmooth analysis
such as the concept of limiting Fr\'echet subgradients, as well as on the fact
that the set of realization functions of shallow ReLU ANNs with fixed
architecture forms a closed subset of the set of continuous functions revealed
by Petersen et al.
Related papers
- A Mean-Field Analysis of Neural Stochastic Gradient Descent-Ascent for Functional Minimax Optimization [90.87444114491116]
This paper studies minimax optimization problems defined over infinite-dimensional function classes of overparametricized two-layer neural networks.
We address (i) the convergence of the gradient descent-ascent algorithm and (ii) the representation learning of the neural networks.
Results show that the feature representation induced by the neural networks is allowed to deviate from the initial one by the magnitude of $O(alpha-1)$, measured in terms of the Wasserstein distance.
arXiv Detail & Related papers (2024-04-18T16:46:08Z) - Benign Overfitting in Deep Neural Networks under Lazy Training [72.28294823115502]
We show that when the data distribution is well-separated, DNNs can achieve Bayes-optimal test error for classification.
Our results indicate that interpolating with smoother functions leads to better generalization.
arXiv Detail & Related papers (2023-05-30T19:37:44Z) - On Feature Learning in Neural Networks with Global Convergence
Guarantees [49.870593940818715]
We study the optimization of wide neural networks (NNs) via gradient flow (GF)
We show that when the input dimension is no less than the size of the training set, the training loss converges to zero at a linear rate under GF.
We also show empirically that, unlike in the Neural Tangent Kernel (NTK) regime, our multi-layer model exhibits feature learning and can achieve better generalization performance than its NTK counterpart.
arXiv Detail & Related papers (2022-04-22T15:56:43Z) - Improved Overparametrization Bounds for Global Convergence of Stochastic
Gradient Descent for Shallow Neural Networks [1.14219428942199]
We study the overparametrization bounds required for the global convergence of gradient descent algorithm for a class of one hidden layer feed-forward neural networks.
arXiv Detail & Related papers (2022-01-28T11:30:06Z) - Mean-field Analysis of Piecewise Linear Solutions for Wide ReLU Networks [83.58049517083138]
We consider a two-layer ReLU network trained via gradient descent.
We show that SGD is biased towards a simple solution.
We also provide empirical evidence that knots at locations distinct from the data points might occur.
arXiv Detail & Related papers (2021-11-03T15:14:20Z) - Existence, uniqueness, and convergence rates for gradient flows in the
training of artificial neural networks with ReLU activation [2.4087148947930634]
The training of artificial neural networks (ANNs) with rectified linear unit (ReLU) activation via gradient descent (GD) type optimization schemes is nowadays a common industrially relevant procedure.
Till this day in the scientific literature there is in general no mathematical convergence analysis which explains the numerical success of GD type schemes in the training of ANNs with ReLU activation.
arXiv Detail & Related papers (2021-08-18T12:06:19Z) - A proof of convergence for the gradient descent optimization method with
random initializations in the training of neural networks with ReLU
activation for piecewise linear target functions [3.198144010381572]
Gradient descent (GD) type optimization methods are the standard instrument to train artificial neural networks (ANNs) with rectified linear unit (ReLU) activation.
arXiv Detail & Related papers (2021-08-10T12:01:37Z) - Convergence analysis for gradient flows in the training of artificial
neural networks with ReLU activation [3.198144010381572]
Gradient descent (GD) type optimization schemes are the standard methods to train artificial neural networks (ANNs) with rectified linear unit (ReLU) activation.
Most of the key difficulties in the mathematical convergence analysis of GD type optimization schemes in the training of ANNs with ReLU activation seem to be already present in the dynamics of the corresponding GF differential equations.
arXiv Detail & Related papers (2021-07-09T15:08:30Z) - Generalization bound of globally optimal non-convex neural network
training: Transportation map estimation by infinite dimensional Langevin
dynamics [50.83356836818667]
We introduce a new theoretical framework to analyze deep learning optimization with connection to its generalization error.
Existing frameworks such as mean field theory and neural tangent kernel theory for neural network optimization analysis typically require taking limit of infinite width of the network to show its global convergence.
arXiv Detail & Related papers (2020-07-11T18:19:50Z) - Optimal Rates for Averaged Stochastic Gradient Descent under Neural
Tangent Kernel Regime [50.510421854168065]
We show that the averaged gradient descent can achieve the minimax optimal convergence rate.
We show that the target function specified by the NTK of a ReLU network can be learned at the optimal convergence rate.
arXiv Detail & Related papers (2020-06-22T14:31:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.