Global Convergence of SGD For Logistic Loss on Two Layer Neural Nets
- URL: http://arxiv.org/abs/2309.09258v2
- Date: Sun, 17 Mar 2024 21:22:05 GMT
- Title: Global Convergence of SGD For Logistic Loss on Two Layer Neural Nets
- Authors: Pulkit Gopalani, Samyak Jha, Anirbit Mukherjee,
- Abstract summary: We show a first-of-its-kind convergence of SGD to the global minima of appropriately regularized logistic empirical risk of depth $2$ nets.
Key idea is to show the existence of Frobenius norm regularized logistic loss functions on constant-sized neural nets.
- Score: 0.20482269513546453
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this note, we demonstrate a first-of-its-kind provable convergence of SGD to the global minima of appropriately regularized logistic empirical risk of depth $2$ nets -- for arbitrary data and with any number of gates with adequately smooth and bounded activations like sigmoid and tanh. We also prove an exponentially fast convergence rate for continuous time SGD that also applies to smooth unbounded activations like SoftPlus. Our key idea is to show the existence of Frobenius norm regularized logistic loss functions on constant-sized neural nets which are "Villani functions" and thus be able to build on recent progress with analyzing SGD on such objectives.
Related papers
- On the Convergence of (Stochastic) Gradient Descent for Kolmogorov--Arnold Networks [56.78271181959529]
Kolmogorov--Arnold Networks (KANs) have gained significant attention in the deep learning community.
Empirical investigations demonstrate that KANs optimized via gradient descent (SGD) are capable of achieving near-zero training loss.
arXiv Detail & Related papers (2024-10-10T15:34:10Z) - On the Trajectories of SGD Without Replacement [0.0]
This article examines the implicit regularization effect of Gradient Descent (SGD)
We consider the case of SGD without replacement, the variant typically used to optimize large-scale neural networks.
arXiv Detail & Related papers (2023-12-26T18:06:48Z) - Generalization Guarantees of Gradient Descent for Multi-Layer Neural
Networks [55.86300309474023]
We conduct a comprehensive stability and generalization analysis of gradient descent (GD) for multi-layer NNs.
We derive the excess risk rate of $O(1/sqrtn)$ for GD algorithms in both two-layer and three-layer NNs.
arXiv Detail & Related papers (2023-05-26T12:51:38Z) - Stability and Generalization of lp-Regularized Stochastic Learning for
GCN [9.517209629978057]
Graph convolutional networks (GCN) are viewed as one of the most popular representations among the variants of graph neural networks over graph data.
This paper aims to quantify the trade-off of GCN between smoothness and sparsity, with the help of a general $ell_p$-regularized $ (1pleq 2)$ learning algorithm.
arXiv Detail & Related papers (2023-05-20T03:49:29Z) - Global Convergence of SGD On Two Layer Neural Nets [0.2302001830524133]
We consider appropriately regularized $ell-$empirical risk of depth $2$ nets with any number of gates.
We show bounds on how the empirical loss evolves for SGD unboundeds on it -- for arbitrary data and if the activation is adequately smooth and bounded like sigmoid and tanh.
arXiv Detail & Related papers (2022-10-20T17:50:46Z) - From Gradient Flow on Population Loss to Learning with Stochastic
Gradient Descent [50.4531316289086]
Gradient Descent (SGD) has been the method of choice for learning large-scale non-root models.
An overarching paper is providing general conditions SGD converges, assuming that GF on the population loss converges.
We provide a unified analysis for GD/SGD not only for classical settings like convex losses, but also for more complex problems including Retrieval Matrix sq-root.
arXiv Detail & Related papers (2022-10-13T03:55:04Z) - Stability and Generalization Analysis of Gradient Methods for Shallow
Neural Networks [59.142826407441106]
We study the generalization behavior of shallow neural networks (SNNs) by leveraging the concept of algorithmic stability.
We consider gradient descent (GD) and gradient descent (SGD) to train SNNs, for both of which we develop consistent excess bounds.
arXiv Detail & Related papers (2022-09-19T18:48:00Z) - On Feature Learning in Neural Networks with Global Convergence
Guarantees [49.870593940818715]
We study the optimization of wide neural networks (NNs) via gradient flow (GF)
We show that when the input dimension is no less than the size of the training set, the training loss converges to zero at a linear rate under GF.
We also show empirically that, unlike in the Neural Tangent Kernel (NTK) regime, our multi-layer model exhibits feature learning and can achieve better generalization performance than its NTK counterpart.
arXiv Detail & Related papers (2022-04-22T15:56:43Z) - Momentum Improves Normalized SGD [51.27183254738711]
We show that adding momentum provably removes the need for large batch sizes on objectives.
We show that our method is effective when employed on popular large scale tasks such as ResNet-50 and BERT pretraining.
arXiv Detail & Related papers (2020-02-09T07:00:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.