Related papers: A proof of convergence for stochastic gradient descent in the training of artificial neural networks with ReLU activation for constant target functions

A proof of convergence for stochastic gradient descent in the training of artificial neural networks with ReLU activation for constant target functions

URL: http://arxiv.org/abs/2104.00277v1
Date: Thu, 1 Apr 2021 06:28:30 GMT
Title: A proof of convergence for stochastic gradient descent in the training of artificial neural networks with ReLU activation for constant target functions
Authors: Arnulf Jentzen, Adrian Riekert
Abstract summary: We study the gradient descent (SGD) optimization method in the training of fully-connected feedforward artificial neural networks with ReLU activation. The main result of this work proves that the risk of the SGD process converges to zero if the target function under consideration is constant.
Score: 3.198144010381572
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In this article we study the stochastic gradient descent (SGD) optimization method in the training of fully-connected feedforward artificial neural networks with ReLU activation. The main result of this work proves that the risk of the SGD process converges to zero if the target function under consideration is constant. In the established convergence result the considered artificial neural networks consist of one input layer, one hidden layer, and one output layer (with $d \in \mathbb{N}$ neurons on the input layer, $H \in \mathbb{N}$ neurons on the hidden layer, and one neuron on the output layer). The learning rates of the SGD process are assumed to be sufficiently small and the input data used in the SGD process to train the artificial neural networks is assumed to be independent and identically distributed.

Related papers

Deep learning with missing data [3.829599191332801]
We propose Pattern Embedded Neural Networks (PENNs), which can be applied in conjunction with any existing imputation technique. In addition to a neural network trained on the imputed data, PENNs pass the vectors of observation indicators through a second neural network to provide a compact representation. The outputs are then combined in a third neural network to produce final predictions.
arXiv Detail & Related papers (2025-04-21T18:57:36Z)
Global Convergence and Rich Feature Learning in $L$-Layer Infinite-Width Neural Networks under $μ$P Parametrization [66.03821840425539]
In this paper, we investigate the training dynamics of $L$-layer neural networks using the tensor gradient program (SGD) framework. We show that SGD enables these networks to learn linearly independent features that substantially deviate from their initial values. This rich feature space captures relevant data information and ensures that any convergent point of the training process is a global minimum.
arXiv Detail & Related papers (2025-03-12T17:33:13Z)
Fractional-order spike-timing-dependent gradient descent for multi-layer spiking neural networks [18.142378139047977]
This paper proposes a fractional-order spike-timing-dependent gradient descent (FOSTDGD) learning model. It is tested on theNIST and DVS128 Gesture datasets and its accuracy under different network structure and fractional orders is analyzed.
arXiv Detail & Related papers (2024-10-20T05:31:34Z)
Efficient SGD Neural Network Training via Sublinear Activated Neuron Identification [22.361338848134025]
We present a fully connected two-layer neural network for shifted ReLU activation to enable activated neuron identification in sublinear time via geometric search. We also prove that our algorithm can converge in $O(M2/epsilon2)$ time with network size quadratic in the coefficient norm upper bound $M$ and error term $epsilon$.
arXiv Detail & Related papers (2023-07-13T05:33:44Z)
Globally Optimal Training of Neural Networks with Threshold Activation Functions [63.03759813952481]
We study weight decay regularized training problems of deep neural networks with threshold activations. We derive a simplified convex optimization formulation when the dataset can be shattered at a certain layer of the network.
arXiv Detail & Related papers (2023-03-06T18:59:13Z)
Implicit Stochastic Gradient Descent for Training Physics-informed Neural Networks [51.92362217307946]
Physics-informed neural networks (PINNs) have effectively been demonstrated in solving forward and inverse differential equation problems. PINNs are trapped in training failures when the target functions to be approximated exhibit high-frequency or multi-scale features. In this paper, we propose to employ implicit gradient descent (ISGD) method to train PINNs for improving the stability of training process.
arXiv Detail & Related papers (2023-03-03T08:17:47Z)
Gradient Descent in Neural Networks as Sequential Learning in RKBS [63.011641517977644]
We construct an exact power-series representation of the neural network in a finite neighborhood of the initial weights. We prove that, regardless of width, the training sequence produced by gradient descent can be exactly replicated by regularized sequential learning.
arXiv Detail & Related papers (2023-02-01T03:18:07Z)
Learning with Local Gradients at the Edge [14.94491070863641]
We present a novel backpropagation-free optimization algorithm dubbed Target Projection Gradient Descent (tpSGD) tpSGD generalizes direct random target projection to work with arbitrary loss functions. We evaluate the performance of tpSGD in training deep neural networks and extend the approach to multi-layer RNNs.
arXiv Detail & Related papers (2022-08-17T19:51:06Z)
Convergence proof for stochastic gradient descent in the training of deep neural networks with ReLU activation for constant target functions [1.7149364927872015]
gradient descent (SGD) type optimization methods perform very effectively in the training of deep neural networks (DNNs) In this work we study SGD type optimization methods in the training of fully-connected feedforward DNNs with rectified linear unit (ReLU) activation.
arXiv Detail & Related papers (2021-12-13T11:45:36Z)
On the Convergence of Shallow Neural Network Training with Randomly Masked Neurons [11.119895959906085]
Given a dense shallow neural network, we focus on creating, training, and combining randomly selected functions. By analyzing $i)$ theworks' neural kernel, $ii)$ the surrogate functions' gradient, and $iii)$ how we sample and combine the surrogate functions, we prove linear convergence rate of the training error. For fixed neuron selection probability, the error term decreases as we increase the number of surrogate models, and increases as we increase the number of local training steps.
arXiv Detail & Related papers (2021-12-05T19:51:14Z)
And/or trade-off in artificial neurons: impact on adversarial robustness [91.3755431537592]
Presence of sufficient number of OR-like neurons in a network can lead to classification brittleness and increased vulnerability to adversarial attacks. We define AND-like neurons and propose measures to increase their proportion in the network. Experimental results on the MNIST dataset suggest that our approach holds promise as a direction for further exploration.
arXiv Detail & Related papers (2021-02-15T08:19:05Z)
Exploiting Heterogeneity in Operational Neural Networks by Synaptic Plasticity [87.32169414230822]
Recently proposed network model, Operational Neural Networks (ONNs), can generalize the conventional Convolutional Neural Networks (CNNs) In this study the focus is drawn on searching the best-possible operator set(s) for the hidden neurons of the network based on the Synaptic Plasticity paradigm that poses the essential learning theory in biological neurons. Experimental results over highly challenging problems demonstrate that the elite ONNs even with few neurons and layers can achieve a superior learning performance than GIS-based ONNs.
arXiv Detail & Related papers (2020-08-21T19:03:23Z)
Modeling from Features: a Mean-field Framework for Over-parameterized Deep Neural Networks [54.27962244835622]
This paper proposes a new mean-field framework for over- parameterized deep neural networks (DNNs) In this framework, a DNN is represented by probability measures and functions over its features in the continuous limit. We illustrate the framework via the standard DNN and the Residual Network (Res-Net) architectures.
arXiv Detail & Related papers (2020-07-03T01:37:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.