Fast Convergence in Learning Two-Layer Neural Networks with Separable
Data
- URL: http://arxiv.org/abs/2305.13471v2
- Date: Mon, 26 Jun 2023 19:19:31 GMT
- Title: Fast Convergence in Learning Two-Layer Neural Networks with Separable
Data
- Authors: Hossein Taheri, Christos Thrampoulidis
- Abstract summary: We study normalized gradient descent on two-layer neural nets.
We prove for exponentially-tailed losses that using normalized GD leads to linear rate of convergence of the training loss to the global optimum.
- Score: 37.908159361149835
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Normalized gradient descent has shown substantial success in speeding up the
convergence of exponentially-tailed loss functions (which includes exponential
and logistic losses) on linear classifiers with separable data. In this paper,
we go beyond linear models by studying normalized GD on two-layer neural nets.
We prove for exponentially-tailed losses that using normalized GD leads to
linear rate of convergence of the training loss to the global optimum if the
iterates find an interpolating model. This is made possible by showing certain
gradient self-boundedness conditions and a log-Lipschitzness property. We also
study generalization of normalized GD for convex objectives via an
algorithmic-stability analysis. In particular, we show that normalized GD does
not overfit during training by establishing finite-time generalization bounds.
Related papers
- Convergence Analysis of Natural Gradient Descent for Over-parameterized Physics-Informed Neural Networks [3.680127959836384]
First-order methods, such as gradient descent (GD) and quadratic descent gradient (SGD) have been proven effective in training neural networks.
However, the learning rate of GD for training two-layer neural networks exhibits poor dependence on the sample size and the Gram matrix.
In this paper, we show that for the $L2$ regression problems, the learning rate can be improved from $mathcalO(1)$ to $mathcalO(1)$, which implies that GD actually enjoys a faster convergence rate.
arXiv Detail & Related papers (2024-08-01T14:06:34Z) - Convergence of Implicit Gradient Descent for Training Two-Layer Physics-Informed Neural Networks [3.680127959836384]
implicit gradient descent (IGD) outperforms the common gradient descent (GD) in handling certain multi-scale problems.
We show that IGD converges a globally optimal solution at a linear convergence rate.
arXiv Detail & Related papers (2024-07-03T06:10:41Z) - The Implicit Bias of Batch Normalization in Linear Models and Two-layer
Linear Convolutional Neural Networks [117.93273337740442]
We show that gradient descent converges to a uniform margin classifier on the training data with an $exp(-Omega(log2 t))$ convergence rate.
We also show that batch normalization has an implicit bias towards a patch-wise uniform margin.
arXiv Detail & Related papers (2023-06-20T16:58:00Z) - Instance-Dependent Generalization Bounds via Optimal Transport [51.71650746285469]
Existing generalization bounds fail to explain crucial factors that drive the generalization of modern neural networks.
We derive instance-dependent generalization bounds that depend on the local Lipschitz regularity of the learned prediction function in the data space.
We empirically analyze our generalization bounds for neural networks, showing that the bound values are meaningful and capture the effect of popular regularization methods during training.
arXiv Detail & Related papers (2022-11-02T16:39:42Z) - Benign Overfitting without Linearity: Neural Network Classifiers Trained
by Gradient Descent for Noisy Linear Data [44.431266188350655]
We consider the generalization error of two-layer neural networks trained to generalize by gradient descent.
We show that neural networks exhibit benign overfitting: they can be driven to zero training error, perfectly fitting any noisy training labels, and simultaneously achieve minimax optimal test error.
In contrast to previous work on benign overfitting that require linear or kernel-based predictors, our analysis holds in a setting where both the model and learning dynamics are fundamentally nonlinear.
arXiv Detail & Related papers (2022-02-11T23:04:00Z) - On the Double Descent of Random Features Models Trained with SGD [78.0918823643911]
We study properties of random features (RF) regression in high dimensions optimized by gradient descent (SGD)
We derive precise non-asymptotic error bounds of RF regression under both constant and adaptive step-size SGD setting.
We observe the double descent phenomenon both theoretically and empirically.
arXiv Detail & Related papers (2021-10-13T17:47:39Z) - Understanding the Generalization of Adam in Learning Neural Networks
with Proper Regularization [118.50301177912381]
We show that Adam can converge to different solutions of the objective with provably different errors, even with weight decay globalization.
We show that if convex, and the weight decay regularization is employed, any optimization algorithms including Adam will converge to the same solution.
arXiv Detail & Related papers (2021-08-25T17:58:21Z) - Universal scaling laws in the gradient descent training of neural
networks [10.508187462682308]
We show that the learning trajectory can be characterized by an explicit bounds at large training times.
Our results are based on spectral analysis of the evolution of a large network trained on the expected loss.
arXiv Detail & Related papers (2021-05-02T16:46:38Z) - Fast Learning of Graph Neural Networks with Guaranteed Generalizability:
One-hidden-layer Case [93.37576644429578]
Graph neural networks (GNNs) have made great progress recently on learning from graph-structured data in practice.
We provide a theoretically-grounded generalizability analysis of GNNs with one hidden layer for both regression and binary classification problems.
arXiv Detail & Related papers (2020-06-25T00:45:52Z) - The Implicit Bias of Gradient Descent on Separable Data [44.98410310356165]
We show the predictor converges to the direction of the max-margin (hard margin SVM) solution.
This can help explain the benefit of continuing to optimize the logistic or cross-entropy loss even after the training error is zero.
arXiv Detail & Related papers (2017-10-27T21:47:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.