Related papers: Almost Sure Convergence of Dropout Algorithms for Neural Networks

Almost Sure Convergence of Dropout Algorithms for Neural Networks

URL: http://arxiv.org/abs/2002.02247v2
Date: Thu, 23 Mar 2023 15:13:14 GMT
Title: Almost Sure Convergence of Dropout Algorithms for Neural Networks
Authors: Albert Senen-Cerda, Jaron Sanders
Abstract summary: We investigate the convergence and rate of multiplying training algorithms for Neural Networks (NNs) that have been inspired by Dropout (on et al., 2012) This paper presents a probability theoretical proof that for fully-connected stationary NNs with differentiable,Hintly bounded activation functions.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We investigate the convergence and convergence rate of stochastic training algorithms for Neural Networks (NNs) that have been inspired by Dropout (Hinton et al., 2012). With the goal of avoiding overfitting during training of NNs, dropout algorithms consist in practice of multiplying the weight matrices of a NN componentwise by independently drawn random matrices with $\{0, 1 \}$-valued entries during each iteration of Stochastic Gradient Descent (SGD). This paper presents a probability theoretical proof that for fully-connected NNs with differentiable, polynomially bounded activation functions, if we project the weights onto a compact set when using a dropout algorithm, then the weights of the NN converge to a unique stationary point of a projected system of Ordinary Differential Equations (ODEs). After this general convergence guarantee, we go on to investigate the convergence rate of dropout. Firstly, we obtain generic sample complexity bounds for finding $\epsilon$-stationary points of smooth nonconvex functions using SGD with dropout that explicitly depend on the dropout probability. Secondly, we obtain an upper bound on the rate of convergence of Gradient Descent (GD) on the limiting ODEs of dropout algorithms for NNs with the shape of arborescences of arbitrary depth and with linear activation functions. The latter bound shows that for an algorithm such as Dropout or Dropconnect (Wan et al., 2013), the convergence rate can be impaired exponentially by the depth of the arborescence. In contrast, we experimentally observe no such dependence for wide NNs with just a few dropout layers. We also provide a heuristic argument for this observation. Our results suggest that there is a change of scale of the effect of the dropout probability in the convergence rate that depends on the relative size of the width of the NN compared to its depth.

Related papers

Neural Networks-based Random Vortex Methods for Modelling Incompressible Flows [0.0]
We introduce a novel Neural Networks-based approach for approximating solutions to the (2D) incompressible Navier--Stokes equations. Our algorithm uses a Neural Network (NN), that approximates the vorticity based on a loss function that uses a computationally efficient formulation of the Random Vortex Dynamics.
arXiv Detail & Related papers (2024-05-22T14:36:23Z)
Learning with Norm Constrained, Over-parameterized, Two-layer Neural Networks [54.177130905659155]
Recent studies show that a reproducing kernel Hilbert space (RKHS) is not a suitable space to model functions by neural networks. In this paper, we study a suitable function space for over- parameterized two-layer neural networks with bounded norms.
arXiv Detail & Related papers (2024-04-29T15:04:07Z)
Approximation Results for Gradient Descent trained Neural Networks [0.0]
The networks are fully connected constant depth increasing width. The continuous kernel error norm implies an approximation under the natural smoothness assumption required for smooth functions.
arXiv Detail & Related papers (2023-09-09T18:47:55Z)
On the Effective Number of Linear Regions in Shallow Univariate ReLU Networks: Convergence Guarantees and Implicit Bias [50.84569563188485]
We show that gradient flow converges in direction when labels are determined by the sign of a target network with $r$ neurons. Our result may already hold for mild over- parameterization, where the width is $tildemathcalO(r)$ and independent of the sample size.
arXiv Detail & Related papers (2022-05-18T16:57:10Z)
On the Neural Tangent Kernel Analysis of Randomly Pruned Neural Networks [91.3755431537592]
We study how random pruning of the weights affects a neural network's neural kernel (NTK) In particular, this work establishes an equivalence of the NTKs between a fully-connected neural network and its randomly pruned version.
arXiv Detail & Related papers (2022-03-27T15:22:19Z)
Neural Optimization Kernel: Towards Robust Deep Learning [13.147925376013129]
Recent studies show a connection between neural networks (NN) and kernel methods. This paper proposes a novel kernel family named Kernel (NOK) We show that over parameterized deep NN (NOK) can increase the expressive power to reduce empirical risk and reduce the bound generalization at the same time.
arXiv Detail & Related papers (2021-06-11T00:34:55Z)
LocalDrop: A Hybrid Regularization for Deep Neural Networks [98.30782118441158]
We propose a new approach for the regularization of neural networks by the local Rademacher complexity called LocalDrop. A new regularization function for both fully-connected networks (FCNs) and convolutional neural networks (CNNs) has been developed based on the proposed upper bound of the local Rademacher complexity.
arXiv Detail & Related papers (2021-03-01T03:10:11Z)
Asymptotic convergence rate of Dropout on shallow linear neural networks [0.0]
We analyze the convergence on objective functions induced by Dropout and Dropconnect, when applying them to shallow linear Neural Networks. We obtain a local convergence proof of the gradient flow and a bound on the rate that depends on the data, the rate probability, and the width of the NN.
arXiv Detail & Related papers (2020-12-01T19:02:37Z)
Random Vector Functional Link Networks for Function Approximation on Manifolds [8.535815777849786]
We show that single layer neural-networks with random input-to-hidden layer weights and biases have seen success in practice. We further adapt this randomized neural network architecture to approximate functions on smooth, compact submanifolds of Euclidean space.
arXiv Detail & Related papers (2020-07-30T23:50:44Z)
Quantitative Propagation of Chaos for SGD in Wide Neural Networks [39.35545193410871]
In this paper, we investigate the limiting behavior of a continuous-time counterpart of the Gradient Descent (SGD) We show 'propagation of chaos' for the particle system defined by this continuous-time dynamics under different scenarios. We identify two under which different mean-field limits are obtained, one of them corresponding to an implicitly regularized version of the minimization problem at hand.
arXiv Detail & Related papers (2020-07-13T12:55:21Z)
Multipole Graph Neural Operator for Parametric Partial Differential Equations [57.90284928158383]
One of the main challenges in using deep learning-based methods for simulating physical systems is formulating physics-based data. We propose a novel multi-level graph neural network framework that captures interaction at all ranges with only linear complexity. Experiments confirm our multi-graph network learns discretization-invariant solution operators to PDEs and can be evaluated in linear time.
arXiv Detail & Related papers (2020-06-16T21:56:22Z)
On Random Kernels of Residual Architectures [93.94469470368988]
We derive finite width and depth corrections for the Neural Tangent Kernel (NTK) of ResNets and DenseNets. Our findings show that in ResNets, convergence to the NTK may occur when depth and width simultaneously tend to infinity. In DenseNets, however, convergence of the NTK to its limit as the width tends to infinity is guaranteed.
arXiv Detail & Related papers (2020-01-28T16:47:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.