Almost Sure Convergence of Dropout Algorithms for Neural Networks
- URL: http://arxiv.org/abs/2002.02247v2
- Date: Thu, 23 Mar 2023 15:13:14 GMT
- Title: Almost Sure Convergence of Dropout Algorithms for Neural Networks
- Authors: Albert Senen-Cerda, Jaron Sanders
- Abstract summary: We investigate the convergence and rate of multiplying training algorithms for Neural Networks (NNs) that have been inspired by Dropout (on et al., 2012)
This paper presents a probability theoretical proof that for fully-connected stationary NNs with differentiable,Hintly bounded activation functions.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We investigate the convergence and convergence rate of stochastic training
algorithms for Neural Networks (NNs) that have been inspired by Dropout (Hinton
et al., 2012). With the goal of avoiding overfitting during training of NNs,
dropout algorithms consist in practice of multiplying the weight matrices of a
NN componentwise by independently drawn random matrices with $\{0, 1 \}$-valued
entries during each iteration of Stochastic Gradient Descent (SGD). This paper
presents a probability theoretical proof that for fully-connected NNs with
differentiable, polynomially bounded activation functions, if we project the
weights onto a compact set when using a dropout algorithm, then the weights of
the NN converge to a unique stationary point of a projected system of Ordinary
Differential Equations (ODEs). After this general convergence guarantee, we go
on to investigate the convergence rate of dropout. Firstly, we obtain generic
sample complexity bounds for finding $\epsilon$-stationary points of smooth
nonconvex functions using SGD with dropout that explicitly depend on the
dropout probability. Secondly, we obtain an upper bound on the rate of
convergence of Gradient Descent (GD) on the limiting ODEs of dropout algorithms
for NNs with the shape of arborescences of arbitrary depth and with linear
activation functions. The latter bound shows that for an algorithm such as
Dropout or Dropconnect (Wan et al., 2013), the convergence rate can be impaired
exponentially by the depth of the arborescence. In contrast, we experimentally
observe no such dependence for wide NNs with just a few dropout layers. We also
provide a heuristic argument for this observation. Our results suggest that
there is a change of scale of the effect of the dropout probability in the
convergence rate that depends on the relative size of the width of the NN
compared to its depth.
Related papers
- Learning with Norm Constrained, Over-parameterized, Two-layer Neural Networks [54.177130905659155]
Recent studies show that a reproducing kernel Hilbert space (RKHS) is not a suitable space to model functions by neural networks.
In this paper, we study a suitable function space for over- parameterized two-layer neural networks with bounded norms.
arXiv Detail & Related papers (2024-04-29T15:04:07Z) - Approximation Results for Gradient Descent trained Neural Networks [0.0]
The networks are fully connected constant depth increasing width.
The continuous kernel error norm implies an approximation under the natural smoothness assumption required for smooth functions.
arXiv Detail & Related papers (2023-09-09T18:47:55Z) - On the Effective Number of Linear Regions in Shallow Univariate ReLU
Networks: Convergence Guarantees and Implicit Bias [50.84569563188485]
We show that gradient flow converges in direction when labels are determined by the sign of a target network with $r$ neurons.
Our result may already hold for mild over- parameterization, where the width is $tildemathcalO(r)$ and independent of the sample size.
arXiv Detail & Related papers (2022-05-18T16:57:10Z) - On the Neural Tangent Kernel Analysis of Randomly Pruned Neural Networks [91.3755431537592]
We study how random pruning of the weights affects a neural network's neural kernel (NTK)
In particular, this work establishes an equivalence of the NTKs between a fully-connected neural network and its randomly pruned version.
arXiv Detail & Related papers (2022-03-27T15:22:19Z) - Neural Optimization Kernel: Towards Robust Deep Learning [13.147925376013129]
Recent studies show a connection between neural networks (NN) and kernel methods.
This paper proposes a novel kernel family named Kernel (NOK)
We show that over parameterized deep NN (NOK) can increase the expressive power to reduce empirical risk and reduce the bound generalization at the same time.
arXiv Detail & Related papers (2021-06-11T00:34:55Z) - LocalDrop: A Hybrid Regularization for Deep Neural Networks [98.30782118441158]
We propose a new approach for the regularization of neural networks by the local Rademacher complexity called LocalDrop.
A new regularization function for both fully-connected networks (FCNs) and convolutional neural networks (CNNs) has been developed based on the proposed upper bound of the local Rademacher complexity.
arXiv Detail & Related papers (2021-03-01T03:10:11Z) - Asymptotic convergence rate of Dropout on shallow linear neural networks [0.0]
We analyze the convergence on objective functions induced by Dropout and Dropconnect, when applying them to shallow linear Neural Networks.
We obtain a local convergence proof of the gradient flow and a bound on the rate that depends on the data, the rate probability, and the width of the NN.
arXiv Detail & Related papers (2020-12-01T19:02:37Z) - Random Vector Functional Link Networks for Function Approximation on Manifolds [8.535815777849786]
We show that single layer neural-networks with random input-to-hidden layer weights and biases have seen success in practice.
We further adapt this randomized neural network architecture to approximate functions on smooth, compact submanifolds of Euclidean space.
arXiv Detail & Related papers (2020-07-30T23:50:44Z) - Quantitative Propagation of Chaos for SGD in Wide Neural Networks [39.35545193410871]
In this paper, we investigate the limiting behavior of a continuous-time counterpart of the Gradient Descent (SGD)
We show 'propagation of chaos' for the particle system defined by this continuous-time dynamics under different scenarios.
We identify two under which different mean-field limits are obtained, one of them corresponding to an implicitly regularized version of the minimization problem at hand.
arXiv Detail & Related papers (2020-07-13T12:55:21Z) - Multipole Graph Neural Operator for Parametric Partial Differential
Equations [57.90284928158383]
One of the main challenges in using deep learning-based methods for simulating physical systems is formulating physics-based data.
We propose a novel multi-level graph neural network framework that captures interaction at all ranges with only linear complexity.
Experiments confirm our multi-graph network learns discretization-invariant solution operators to PDEs and can be evaluated in linear time.
arXiv Detail & Related papers (2020-06-16T21:56:22Z) - On Random Kernels of Residual Architectures [93.94469470368988]
We derive finite width and depth corrections for the Neural Tangent Kernel (NTK) of ResNets and DenseNets.
Our findings show that in ResNets, convergence to the NTK may occur when depth and width simultaneously tend to infinity.
In DenseNets, however, convergence of the NTK to its limit as the width tends to infinity is guaranteed.
arXiv Detail & Related papers (2020-01-28T16:47:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.