Asymptotic convergence rate of Dropout on shallow linear neural networks
- URL: http://arxiv.org/abs/2012.01978v1
- Date: Tue, 1 Dec 2020 19:02:37 GMT
- Title: Asymptotic convergence rate of Dropout on shallow linear neural networks
- Authors: Albert Senen-Cerda, Jaron Sanders
- Abstract summary: We analyze the convergence on objective functions induced by Dropout and Dropconnect, when applying them to shallow linear Neural Networks.
We obtain a local convergence proof of the gradient flow and a bound on the rate that depends on the data, the rate probability, and the width of the NN.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We analyze the convergence rate of gradient flows on objective functions
induced by Dropout and Dropconnect, when applying them to shallow linear Neural
Networks (NNs) - which can also be viewed as doing matrix factorization using a
particular regularizer. Dropout algorithms such as these are thus
regularization techniques that use 0,1-valued random variables to filter
weights during training in order to avoid coadaptation of features. By
leveraging a recent result on nonconvex optimization and conducting a careful
analysis of the set of minimizers as well as the Hessian of the loss function,
we are able to obtain (i) a local convergence proof of the gradient flow and
(ii) a bound on the convergence rate that depends on the data, the dropout
probability, and the width of the NN. Finally, we compare this theoretical
bound to numerical simulations, which are in qualitative agreement with the
convergence bound and match it when starting sufficiently close to a minimizer.
Related papers
- Non-asymptotic convergence analysis of the stochastic gradient
Hamiltonian Monte Carlo algorithm with discontinuous stochastic gradient with
applications to training of ReLU neural networks [8.058385158111207]
We provide a non-asymptotic analysis of the convergence of the gradient Hamiltonian Monte Carlo to a target measure in Wasserstein-1 and Wasserstein-2 distance.
To illustrate our main results, we consider numerical experiments on quantile estimation and on several problems involving ReLU neural networks relevant in finance and artificial intelligence.
arXiv Detail & Related papers (2024-09-25T17:21:09Z) - A Mean-Field Analysis of Neural Stochastic Gradient Descent-Ascent for Functional Minimax Optimization [90.87444114491116]
This paper studies minimax optimization problems defined over infinite-dimensional function classes of overparametricized two-layer neural networks.
We address (i) the convergence of the gradient descent-ascent algorithm and (ii) the representation learning of the neural networks.
Results show that the feature representation induced by the neural networks is allowed to deviate from the initial one by the magnitude of $O(alpha-1)$, measured in terms of the Wasserstein distance.
arXiv Detail & Related papers (2024-04-18T16:46:08Z) - Stable Nonconvex-Nonconcave Training via Linear Interpolation [51.668052890249726]
This paper presents a theoretical analysis of linearahead as a principled method for stabilizing (large-scale) neural network training.
We argue that instabilities in the optimization process are often caused by the nonmonotonicity of the loss landscape and show how linear can help by leveraging the theory of nonexpansive operators.
arXiv Detail & Related papers (2023-10-20T12:45:12Z) - Implicit regularization in AI meets generalized hardness of
approximation in optimization -- Sharp results for diagonal linear networks [0.0]
We show sharp results for the implicit regularization imposed by the gradient flow of Diagonal Linear Networks.
We link this to the phenomenon of phase transitions in generalized hardness of approximation.
Non-sharpness of our results would imply that the GHA phenomenon would not occur for the basis pursuit optimization problem.
arXiv Detail & Related papers (2023-07-13T13:27:51Z) - On Asymptotic Linear Convergence of Projected Gradient Descent for
Constrained Least Squares [22.851500417035947]
This manuscript presents a unified framework for the analysis of projected gradient descent in the context of constrained least squares.
We present a recipe for the convergence analysis of PGD and demonstrate it via a beginning-to-end application of the recipe on four fundamental problems.
arXiv Detail & Related papers (2021-12-22T09:49:51Z) - On Convergence of Training Loss Without Reaching Stationary Points [62.41370821014218]
We show that Neural Network weight variables do not converge to stationary points where the gradient the loss function vanishes.
We propose a new perspective based on ergodic theory dynamical systems.
arXiv Detail & Related papers (2021-10-12T18:12:23Z) - Differentiable Annealed Importance Sampling and the Perils of Gradient
Noise [68.44523807580438]
Annealed importance sampling (AIS) and related algorithms are highly effective tools for marginal likelihood estimation.
Differentiability is a desirable property as it would admit the possibility of optimizing marginal likelihood as an objective.
We propose a differentiable algorithm by abandoning Metropolis-Hastings steps, which further unlocks mini-batch computation.
arXiv Detail & Related papers (2021-07-21T17:10:14Z) - Optimal Rates for Averaged Stochastic Gradient Descent under Neural
Tangent Kernel Regime [50.510421854168065]
We show that the averaged gradient descent can achieve the minimax optimal convergence rate.
We show that the target function specified by the NTK of a ReLU network can be learned at the optimal convergence rate.
arXiv Detail & Related papers (2020-06-22T14:31:37Z) - Log-Likelihood Ratio Minimizing Flows: Towards Robust and Quantifiable
Neural Distribution Alignment [52.02794488304448]
We propose a new distribution alignment method based on a log-likelihood ratio statistic and normalizing flows.
We experimentally verify that minimizing the resulting objective results in domain alignment that preserves the local structure of input domains.
arXiv Detail & Related papers (2020-03-26T22:10:04Z) - Nonconvex sparse regularization for deep neural networks and its
optimality [1.9798034349981162]
Deep neural network (DNN) estimators can attain optimal convergence rates for regression and classification problems.
We propose a novel penalized estimation method for sparse DNNs.
We prove that the sparse-penalized estimator can adaptively attain minimax convergence rates for various nonparametric regression problems.
arXiv Detail & Related papers (2020-03-26T07:15:28Z) - Almost Sure Convergence of Dropout Algorithms for Neural Networks [0.0]
We investigate the convergence and rate of multiplying training algorithms for Neural Networks (NNs) that have been inspired by Dropout (on et al., 2012)
This paper presents a probability theoretical proof that for fully-connected stationary NNs with differentiable,Hintly bounded activation functions.
arXiv Detail & Related papers (2020-02-06T13:25:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.