Convergence analysis for gradient flows in the training of artificial
neural networks with ReLU activation
- URL: http://arxiv.org/abs/2107.04479v1
- Date: Fri, 9 Jul 2021 15:08:30 GMT
- Title: Convergence analysis for gradient flows in the training of artificial
neural networks with ReLU activation
- Authors: Arnulf Jentzen and Adrian Riekert
- Abstract summary: Gradient descent (GD) type optimization schemes are the standard methods to train artificial neural networks (ANNs) with rectified linear unit (ReLU) activation.
Most of the key difficulties in the mathematical convergence analysis of GD type optimization schemes in the training of ANNs with ReLU activation seem to be already present in the dynamics of the corresponding GF differential equations.
- Score: 3.198144010381572
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Gradient descent (GD) type optimization schemes are the standard methods to
train artificial neural networks (ANNs) with rectified linear unit (ReLU)
activation. Such schemes can be considered as discretizations of gradient flows
(GFs) associated to the training of ANNs with ReLU activation and most of the
key difficulties in the mathematical convergence analysis of GD type
optimization schemes in the training of ANNs with ReLU activation seem to be
already present in the dynamics of the corresponding GF differential equations.
It is the key subject of this work to analyze such GF differential equations in
the training of ANNs with ReLU activation and three layers (one input layer,
one hidden layer, and one output layer). In particular, in this article we
prove in the case where the target function is possibly multi-dimensional and
continuous and in the case where the probability distribution of the input data
is absolutely continuous with respect to the Lebesgue measure that the risk of
every bounded GF trajectory converges to the risk of a critical point. In
addition, in this article we show in the case of a 1-dimensional affine linear
target function and in the case where the probability distribution of the input
data coincides with the standard uniform distribution that the risk of every
bounded GF trajectory converges to zero if the initial risk is sufficiently
small. Finally, in the special situation where there is only one neuron on the
hidden layer (1-dimensional hidden layer) we strengthen the above named result
for affine linear target functions by proving that that the risk of every (not
necessarily bounded) GF trajectory converges to zero if the initial risk is
sufficiently small.
Related papers
- A Functional-Space Mean-Field Theory of Partially-Trained Three-Layer
Neural Networks [49.870593940818715]
We study the infinite-width limit of a type of three-layer NN model whose first layer is random and fixed.
Our theory accommodates different scaling choices of the model, resulting in two regimes of the MF limit that demonstrate distinctive behaviors.
arXiv Detail & Related papers (2022-10-28T17:26:27Z) - Stability and Generalization Analysis of Gradient Methods for Shallow
Neural Networks [59.142826407441106]
We study the generalization behavior of shallow neural networks (SNNs) by leveraging the concept of algorithmic stability.
We consider gradient descent (GD) and gradient descent (SGD) to train SNNs, for both of which we develop consistent excess bounds.
arXiv Detail & Related papers (2022-09-19T18:48:00Z) - On the Effective Number of Linear Regions in Shallow Univariate ReLU
Networks: Convergence Guarantees and Implicit Bias [50.84569563188485]
We show that gradient flow converges in direction when labels are determined by the sign of a target network with $r$ neurons.
Our result may already hold for mild over- parameterization, where the width is $tildemathcalO(r)$ and independent of the sample size.
arXiv Detail & Related papers (2022-05-18T16:57:10Z) - On Feature Learning in Neural Networks with Global Convergence
Guarantees [49.870593940818715]
We study the optimization of wide neural networks (NNs) via gradient flow (GF)
We show that when the input dimension is no less than the size of the training set, the training loss converges to zero at a linear rate under GF.
We also show empirically that, unlike in the Neural Tangent Kernel (NTK) regime, our multi-layer model exhibits feature learning and can achieve better generalization performance than its NTK counterpart.
arXiv Detail & Related papers (2022-04-22T15:56:43Z) - Improved Overparametrization Bounds for Global Convergence of Stochastic
Gradient Descent for Shallow Neural Networks [1.14219428942199]
We study the overparametrization bounds required for the global convergence of gradient descent algorithm for a class of one hidden layer feed-forward neural networks.
arXiv Detail & Related papers (2022-01-28T11:30:06Z) - On the existence of global minima and convergence analyses for gradient
descent methods in the training of deep neural networks [3.198144010381572]
We study feedforward deep ReLU ANNs with an arbitrarily large number of hidden layers.
We prove convergence of the risk of the GD optimization method with randoms in the training of such ANNs.
We also study solutions of gradient flow differential equations.
arXiv Detail & Related papers (2021-12-17T18:55:40Z) - Convergence proof for stochastic gradient descent in the training of
deep neural networks with ReLU activation for constant target functions [1.7149364927872015]
gradient descent (SGD) type optimization methods perform very effectively in the training of deep neural networks (DNNs)
In this work we study SGD type optimization methods in the training of fully-connected feedforward DNNs with rectified linear unit (ReLU) activation.
arXiv Detail & Related papers (2021-12-13T11:45:36Z) - Mean-field Analysis of Piecewise Linear Solutions for Wide ReLU Networks [83.58049517083138]
We consider a two-layer ReLU network trained via gradient descent.
We show that SGD is biased towards a simple solution.
We also provide empirical evidence that knots at locations distinct from the data points might occur.
arXiv Detail & Related papers (2021-11-03T15:14:20Z) - Existence, uniqueness, and convergence rates for gradient flows in the
training of artificial neural networks with ReLU activation [2.4087148947930634]
The training of artificial neural networks (ANNs) with rectified linear unit (ReLU) activation via gradient descent (GD) type optimization schemes is nowadays a common industrially relevant procedure.
Till this day in the scientific literature there is in general no mathematical convergence analysis which explains the numerical success of GD type schemes in the training of ANNs with ReLU activation.
arXiv Detail & Related papers (2021-08-18T12:06:19Z) - A proof of convergence for the gradient descent optimization method with
random initializations in the training of neural networks with ReLU
activation for piecewise linear target functions [3.198144010381572]
Gradient descent (GD) type optimization methods are the standard instrument to train artificial neural networks (ANNs) with rectified linear unit (ReLU) activation.
arXiv Detail & Related papers (2021-08-10T12:01:37Z) - Optimal Rates for Averaged Stochastic Gradient Descent under Neural
Tangent Kernel Regime [50.510421854168065]
We show that the averaged gradient descent can achieve the minimax optimal convergence rate.
We show that the target function specified by the NTK of a ReLU network can be learned at the optimal convergence rate.
arXiv Detail & Related papers (2020-06-22T14:31:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.