Convergence proof for stochastic gradient descent in the training of
deep neural networks with ReLU activation for constant target functions
- URL: http://arxiv.org/abs/2112.07369v2
- Date: Thu, 22 Jun 2023 18:05:01 GMT
- Title: Convergence proof for stochastic gradient descent in the training of
deep neural networks with ReLU activation for constant target functions
- Authors: Martin Hutzenthaler, Arnulf Jentzen, Katharina Pohl, Adrian Riekert,
Luca Scarpa
- Abstract summary: gradient descent (SGD) type optimization methods perform very effectively in the training of deep neural networks (DNNs)
In this work we study SGD type optimization methods in the training of fully-connected feedforward DNNs with rectified linear unit (ReLU) activation.
- Score: 1.7149364927872015
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In many numerical simulations stochastic gradient descent (SGD) type
optimization methods perform very effectively in the training of deep neural
networks (DNNs) but till this day it remains an open problem of research to
provide a mathematical convergence analysis which rigorously explains the
success of SGD type optimization methods in the training of DNNs. In this work
we study SGD type optimization methods in the training of fully-connected
feedforward DNNs with rectified linear unit (ReLU) activation. We first
establish general regularity properties for the risk functions and their
generalized gradient functions appearing in the training of such DNNs and,
thereafter, we investigate the plain vanilla SGD optimization method in the
training of such DNNs under the assumption that the target function under
consideration is a constant function. Specifically, we prove under the
assumption that the learning rates (the step sizes of the SGD optimization
method) are sufficiently small but not $L^1$-summable and under the assumption
that the target function is a constant function that the expectation of the
riskof the considered SGD process converges in the training of such DNNs to
zero as the number of SGD steps increases to infinity.
Related papers
- Non-convergence to global minimizers in data driven supervised deep learning: Adam and stochastic gradient descent optimization provably fail to converge to global minimizers in the training of deep neural networks with ReLU activation [3.6185342807265415]
It remains an open problem of research to explain the success and the limitations of SGD methods in rigorous theoretical terms.
In this work we prove for a large class of SGD methods that the considered does with high probability not converge to global minimizers of the optimization problem.
The general non-convergence results of this work do not only apply to the plain vanilla standard SGD method but also to a large class of accelerated and adaptive SGD methods.
arXiv Detail & Related papers (2024-10-14T14:11:37Z) - Implicit Stochastic Gradient Descent for Training Physics-informed
Neural Networks [51.92362217307946]
Physics-informed neural networks (PINNs) have effectively been demonstrated in solving forward and inverse differential equation problems.
PINNs are trapped in training failures when the target functions to be approximated exhibit high-frequency or multi-scale features.
In this paper, we propose to employ implicit gradient descent (ISGD) method to train PINNs for improving the stability of training process.
arXiv Detail & Related papers (2023-03-03T08:17:47Z) - Stability and Generalization Analysis of Gradient Methods for Shallow
Neural Networks [59.142826407441106]
We study the generalization behavior of shallow neural networks (SNNs) by leveraging the concept of algorithmic stability.
We consider gradient descent (GD) and gradient descent (SGD) to train SNNs, for both of which we develop consistent excess bounds.
arXiv Detail & Related papers (2022-09-19T18:48:00Z) - On Feature Learning in Neural Networks with Global Convergence
Guarantees [49.870593940818715]
We study the optimization of wide neural networks (NNs) via gradient flow (GF)
We show that when the input dimension is no less than the size of the training set, the training loss converges to zero at a linear rate under GF.
We also show empirically that, unlike in the Neural Tangent Kernel (NTK) regime, our multi-layer model exhibits feature learning and can achieve better generalization performance than its NTK counterpart.
arXiv Detail & Related papers (2022-04-22T15:56:43Z) - Comparative Analysis of Interval Reachability for Robust Implicit and
Feedforward Neural Networks [64.23331120621118]
We use interval reachability analysis to obtain robustness guarantees for implicit neural networks (INNs)
INNs are a class of implicit learning models that use implicit equations as layers.
We show that our approach performs at least as well as, and generally better than, applying state-of-the-art interval bound propagation methods to INNs.
arXiv Detail & Related papers (2022-04-01T03:31:27Z) - Existence, uniqueness, and convergence rates for gradient flows in the
training of artificial neural networks with ReLU activation [2.4087148947930634]
The training of artificial neural networks (ANNs) with rectified linear unit (ReLU) activation via gradient descent (GD) type optimization schemes is nowadays a common industrially relevant procedure.
Till this day in the scientific literature there is in general no mathematical convergence analysis which explains the numerical success of GD type schemes in the training of ANNs with ReLU activation.
arXiv Detail & Related papers (2021-08-18T12:06:19Z) - A proof of convergence for the gradient descent optimization method with
random initializations in the training of neural networks with ReLU
activation for piecewise linear target functions [3.198144010381572]
Gradient descent (GD) type optimization methods are the standard instrument to train artificial neural networks (ANNs) with rectified linear unit (ReLU) activation.
arXiv Detail & Related papers (2021-08-10T12:01:37Z) - A proof of convergence for gradient descent in the training of
artificial neural networks for constant target functions [3.4792548480344254]
We show that the risk function of the gradient descent method does indeed converge to zero.
A key contribution of this work is to explicitly specify a Lyapunov function for the gradient flow system of the ANN parameters.
arXiv Detail & Related papers (2021-02-19T13:33:03Z) - Modeling from Features: a Mean-field Framework for Over-parameterized
Deep Neural Networks [54.27962244835622]
This paper proposes a new mean-field framework for over- parameterized deep neural networks (DNNs)
In this framework, a DNN is represented by probability measures and functions over its features in the continuous limit.
We illustrate the framework via the standard DNN and the Residual Network (Res-Net) architectures.
arXiv Detail & Related papers (2020-07-03T01:37:16Z) - Optimal Rates for Averaged Stochastic Gradient Descent under Neural
Tangent Kernel Regime [50.510421854168065]
We show that the averaged gradient descent can achieve the minimax optimal convergence rate.
We show that the target function specified by the NTK of a ReLU network can be learned at the optimal convergence rate.
arXiv Detail & Related papers (2020-06-22T14:31:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.