Global Convergence of SGD On Two Layer Neural Nets
- URL: http://arxiv.org/abs/2210.11452v2
- Date: Sat, 8 Apr 2023 14:33:08 GMT
- Title: Global Convergence of SGD On Two Layer Neural Nets
- Authors: Pulkit Gopalani and Anirbit Mukherjee
- Abstract summary: We show provable convergence of SGD to the global minima of appropriately regularized $elldinger-$empirical risk of depth $2$ nets.
We leverage a constant amount of Frobenius regularization on the weights, along with sampling of the initial weights from an appropriate distribution.
- Score: 0.7614628596146599
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this note we demonstrate provable convergence of SGD to the global minima
of appropriately regularized $\ell_2-$empirical risk of depth $2$ nets -- for
arbitrary data and with any number of gates, if they are using adequately
smooth and bounded activations like sigmoid and tanh. We build on the results
in [1] and leverage a constant amount of Frobenius norm regularization on the
weights, along with sampling of the initial weights from an appropriate
distribution. We also give a continuous time SGD convergence result that also
applies to smooth unbounded activations like SoftPlus. Our key idea is to show
the existence loss functions on constant sized neural nets which are "Villani
Functions". [1] Bin Shi, Weijie J. Su, and Michael I. Jordan. On learning rates
and schr\"odinger operators, 2020. arXiv:2004.06977
Related papers
- Learning with Norm Constrained, Over-parameterized, Two-layer Neural Networks [54.177130905659155]
Recent studies show that a reproducing kernel Hilbert space (RKHS) is not a suitable space to model functions by neural networks.
In this paper, we study a suitable function space for over- parameterized two-layer neural networks with bounded norms.
arXiv Detail & Related papers (2024-04-29T15:04:07Z) - On the Trajectories of SGD Without Replacement [0.0]
This article examines the implicit regularization effect of Gradient Descent (SGD)
We consider the case of SGD without replacement, the variant typically used to optimize large-scale neural networks.
arXiv Detail & Related papers (2023-12-26T18:06:48Z) - Global Convergence of SGD For Logistic Loss on Two Layer Neural Nets [0.20482269513546453]
We show a first-of-its-kind convergence of SGD to the global minima of appropriately regularized logistic empirical risk of depth $2$ nets.
Key idea is to show the existence of Frobenius norm regularized logistic loss functions on constant-sized neural nets.
arXiv Detail & Related papers (2023-09-17T12:44:07Z) - Convergence Analysis of Decentralized ASGD [1.8710230264817358]
We present a novel convergence-rate analysis for decentralized asynchronous SGD (DASGD) which does not require partial synchronization among nodes nor restrictive network topologies.
Our convergence proof holds for a fixed stepsize and any nonsmooth, homogeneous, L-shaped objective function.
arXiv Detail & Related papers (2023-09-07T14:50:31Z) - Benign Overfitting in Deep Neural Networks under Lazy Training [72.28294823115502]
We show that when the data distribution is well-separated, DNNs can achieve Bayes-optimal test error for classification.
Our results indicate that interpolating with smoother functions leads to better generalization.
arXiv Detail & Related papers (2023-05-30T19:37:44Z) - Stability and Generalization of lp-Regularized Stochastic Learning for
GCN [9.517209629978057]
Graph convolutional networks (GCN) are viewed as one of the most popular representations among the variants of graph neural networks over graph data.
This paper aims to quantify the trade-off of GCN between smoothness and sparsity, with the help of a general $ell_p$-regularized $ (1pleq 2)$ learning algorithm.
arXiv Detail & Related papers (2023-05-20T03:49:29Z) - From Gradient Flow on Population Loss to Learning with Stochastic
Gradient Descent [50.4531316289086]
Gradient Descent (SGD) has been the method of choice for learning large-scale non-root models.
An overarching paper is providing general conditions SGD converges, assuming that GF on the population loss converges.
We provide a unified analysis for GD/SGD not only for classical settings like convex losses, but also for more complex problems including Retrieval Matrix sq-root.
arXiv Detail & Related papers (2022-10-13T03:55:04Z) - Stability and Generalization Analysis of Gradient Methods for Shallow
Neural Networks [59.142826407441106]
We study the generalization behavior of shallow neural networks (SNNs) by leveraging the concept of algorithmic stability.
We consider gradient descent (GD) and gradient descent (SGD) to train SNNs, for both of which we develop consistent excess bounds.
arXiv Detail & Related papers (2022-09-19T18:48:00Z) - Feature selection with gradient descent on two-layer networks in
low-rotation regimes [20.41989568533313]
This work establishes low test error of gradient flow (GF) and gradient descent gradient (SGD) on two-layer ReLU networks.
It makes use of margins as the core analytic technique.
arXiv Detail & Related papers (2022-08-04T17:43:36Z) - Bounding the Width of Neural Networks via Coupled Initialization -- A
Worst Case Analysis [121.9821494461427]
We show how to significantly reduce the number of neurons required for two-layer ReLU networks.
We also prove new lower bounds that improve upon prior work, and that under certain assumptions, are best possible.
arXiv Detail & Related papers (2022-06-26T06:51:31Z) - Benign Underfitting of Stochastic Gradient Descent [72.38051710389732]
We study to what extent may gradient descent (SGD) be understood as a "conventional" learning rule that achieves generalization performance by obtaining a good fit training data.
We analyze the closely related with-replacement SGD, for which an analogous phenomenon does not occur and prove that its population risk does in fact converge at the optimal rate.
arXiv Detail & Related papers (2022-02-27T13:25:01Z) - On the Double Descent of Random Features Models Trained with SGD [78.0918823643911]
We study properties of random features (RF) regression in high dimensions optimized by gradient descent (SGD)
We derive precise non-asymptotic error bounds of RF regression under both constant and adaptive step-size SGD setting.
We observe the double descent phenomenon both theoretically and empirically.
arXiv Detail & Related papers (2021-10-13T17:47:39Z) - SGD: The Role of Implicit Regularization, Batch-size and Multiple-epochs [30.41773138781369]
We present a multi-epoch variant of Gradient Descent (SGD) commonly used in practice.
We prove that this is at least as good as single pass SGD in the worst case.
For certain SCO problems, taking multiple passes over the dataset can significantly outperform single pass SGD.
arXiv Detail & Related papers (2021-07-11T15:50:01Z) - Convergence Rates of Stochastic Gradient Descent under Infinite Noise
Variance [14.06947898164194]
Heavy tails emerge in gradient descent (SGD) in various scenarios.
We provide convergence guarantees for SGD under a state-dependent and heavy-tailed noise with a potentially infinite variance.
Our results indicate that even under heavy-tailed noise with infinite variance, SGD can converge to the global optimum.
arXiv Detail & Related papers (2021-02-20T13:45:11Z) - On the Global Convergence of Training Deep Linear ResNets [104.76256863926629]
We study the convergence of gradient descent (GD) and gradient descent (SGD) for training $L$-hidden-layer linear residual networks (ResNets)
We prove that for training deep residual networks with certain linear transformations at input and output layers, both GD and SGD can converge to the global minimum of the training loss.
arXiv Detail & Related papers (2020-03-02T18:34:49Z) - Momentum Improves Normalized SGD [51.27183254738711]
We show that adding momentum provably removes the need for large batch sizes on objectives.
We show that our method is effective when employed on popular large scale tasks such as ResNet-50 and BERT pretraining.
arXiv Detail & Related papers (2020-02-09T07:00:54Z) - On the Generalization of Stochastic Gradient Descent with Momentum [84.54924994010703]
momentum-based accelerated variants of gradient descent (SGD) are widely used when training machine learning models.
We first show that there exists a convex loss function for which the stability gap for multiple epochs of SGD with standard heavy-ball momentum (SGDM) becomes unbounded.
For smooth Lipschitz loss functions, we analyze a modified momentum-based update rule, i.e., SGD with early momentum (SGDEM) under a broad range of step-sizes.
arXiv Detail & Related papers (2018-09-12T17:02:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.