Related papers: Global Convergence of SGD On Two Layer Neural Nets

Related papers

Tight Generalization Error Bounds for Stochastic Gradient Descent in Non-convex Learning [1.8136828360307795]
We show that Gradient Descent (SGD) can be used to establish a tighter term for ensuring non- bound data in deep networks.<n>Our theoretical results include MNISTAR, demonstrating the effectiveness of T2pm-SGD in training and neural training.
arXiv Detail & Related papers (2025-06-23T13:47:25Z)
Learning with Norm Constrained, Over-parameterized, Two-layer Neural Networks [54.177130905659155]
Recent studies show that a reproducing kernel Hilbert space (RKHS) is not a suitable space to model functions by neural networks. In this paper, we study a suitable function space for over- parameterized two-layer neural networks with bounded norms.
arXiv Detail & Related papers (2024-04-29T15:04:07Z)
On the Trajectories of SGD Without Replacement [0.0]
This article examines the implicit regularization effect of Gradient Descent (SGD) We consider the case of SGD without replacement, the variant typically used to optimize large-scale neural networks.
arXiv Detail & Related papers (2023-12-26T18:06:48Z)
Global Convergence of SGD For Logistic Loss on Two Layer Neural Nets [0.20482269513546453]
We show a first-of-its-kind convergence of SGD to the global minima of appropriately regularized logistic empirical risk of depth $2$ nets. Key idea is to show the existence of Frobenius norm regularized logistic loss functions on constant-sized neural nets.
arXiv Detail & Related papers (2023-09-17T12:44:07Z)
Convergence Analysis of Decentralized ASGD [1.8710230264817358]
We present a novel convergence-rate analysis for decentralized asynchronous SGD (DASGD) which does not require partial synchronization among nodes nor restrictive network topologies. Our convergence proof holds for a fixed stepsize and any nonsmooth, homogeneous, L-shaped objective function.
arXiv Detail & Related papers (2023-09-07T14:50:31Z)
Benign Overfitting in Deep Neural Networks under Lazy Training [72.28294823115502]
We show that when the data distribution is well-separated, DNNs can achieve Bayes-optimal test error for classification. Our results indicate that interpolating with smoother functions leads to better generalization.
arXiv Detail & Related papers (2023-05-30T19:37:44Z)
Stability and Generalization of lp-Regularized Stochastic Learning for GCN [9.517209629978057]
Graph convolutional networks (GCN) are viewed as one of the most popular representations among the variants of graph neural networks over graph data. This paper aims to quantify the trade-off of GCN between smoothness and sparsity, with the help of a general $ell_p$-regularized $ (1pleq 2)$ learning algorithm.
arXiv Detail & Related papers (2023-05-20T03:49:29Z)
From Gradient Flow on Population Loss to Learning with Stochastic Gradient Descent [50.4531316289086]
Gradient Descent (SGD) has been the method of choice for learning large-scale non-root models. An overarching paper is providing general conditions SGD converges, assuming that GF on the population loss converges. We provide a unified analysis for GD/SGD not only for classical settings like convex losses, but also for more complex problems including Retrieval Matrix sq-root.
arXiv Detail & Related papers (2022-10-13T03:55:04Z)
Stability and Generalization Analysis of Gradient Methods for Shallow Neural Networks [59.142826407441106]
We study the generalization behavior of shallow neural networks (SNNs) by leveraging the concept of algorithmic stability. We consider gradient descent (GD) and gradient descent (SGD) to train SNNs, for both of which we develop consistent excess bounds.
arXiv Detail & Related papers (2022-09-19T18:48:00Z)
Feature selection with gradient descent on two-layer networks in low-rotation regimes [20.41989568533313]
This work establishes low test error of gradient flow (GF) and gradient descent gradient (SGD) on two-layer ReLU networks. It makes use of margins as the core analytic technique.
arXiv Detail & Related papers (2022-08-04T17:43:36Z)
Bounding the Width of Neural Networks via Coupled Initialization -- A Worst Case Analysis [121.9821494461427]
We show how to significantly reduce the number of neurons required for two-layer ReLU networks. We also prove new lower bounds that improve upon prior work, and that under certain assumptions, are best possible.
arXiv Detail & Related papers (2022-06-26T06:51:31Z)
Benign Underfitting of Stochastic Gradient Descent [72.38051710389732]
We study to what extent may gradient descent (SGD) be understood as a "conventional" learning rule that achieves generalization performance by obtaining a good fit training data. We analyze the closely related with-replacement SGD, for which an analogous phenomenon does not occur and prove that its population risk does in fact converge at the optimal rate.
arXiv Detail & Related papers (2022-02-27T13:25:01Z)
On the Double Descent of Random Features Models Trained with SGD [78.0918823643911]
We study properties of random features (RF) regression in high dimensions optimized by gradient descent (SGD) We derive precise non-asymptotic error bounds of RF regression under both constant and adaptive step-size SGD setting. We observe the double descent phenomenon both theoretically and empirically.
arXiv Detail & Related papers (2021-10-13T17:47:39Z)
SGD: The Role of Implicit Regularization, Batch-size and Multiple-epochs [30.41773138781369]
We present a multi-epoch variant of Gradient Descent (SGD) commonly used in practice. We prove that this is at least as good as single pass SGD in the worst case. For certain SCO problems, taking multiple passes over the dataset can significantly outperform single pass SGD.
arXiv Detail & Related papers (2021-07-11T15:50:01Z)
Convergence Rates of Stochastic Gradient Descent under Infinite Noise Variance [14.06947898164194]
Heavy tails emerge in gradient descent (SGD) in various scenarios. We provide convergence guarantees for SGD under a state-dependent and heavy-tailed noise with a potentially infinite variance. Our results indicate that even under heavy-tailed noise with infinite variance, SGD can converge to the global optimum.
arXiv Detail & Related papers (2021-02-20T13:45:11Z)
On the Global Convergence of Training Deep Linear ResNets [104.76256863926629]
We study the convergence of gradient descent (GD) and gradient descent (SGD) for training $L$-hidden-layer linear residual networks (ResNets) We prove that for training deep residual networks with certain linear transformations at input and output layers, both GD and SGD can converge to the global minimum of the training loss.
arXiv Detail & Related papers (2020-03-02T18:34:49Z)
Momentum Improves Normalized SGD [51.27183254738711]
We show that adding momentum provably removes the need for large batch sizes on objectives. We show that our method is effective when employed on popular large scale tasks such as ResNet-50 and BERT pretraining.
arXiv Detail & Related papers (2020-02-09T07:00:54Z)
On the Generalization of Stochastic Gradient Descent with Momentum [84.54924994010703]
momentum-based accelerated variants of gradient descent (SGD) are widely used when training machine learning models. We first show that there exists a convex loss function for which the stability gap for multiple epochs of SGD with standard heavy-ball momentum (SGDM) becomes unbounded. For smooth Lipschitz loss functions, we analyze a modified momentum-based update rule, i.e., SGD with early momentum (SGDEM) under a broad range of step-sizes.
arXiv Detail & Related papers (2018-09-12T17:02:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.