Escaping mediocrity: how two-layer networks learn hard generalized
linear models with SGD
- URL: http://arxiv.org/abs/2305.18502v2
- Date: Fri, 1 Mar 2024 15:45:05 GMT
- Title: Escaping mediocrity: how two-layer networks learn hard generalized
linear models with SGD
- Authors: Luca Arnaboldi, Florent Krzakala, Bruno Loureiro, Ludovic Stephan
- Abstract summary: This study explores the sample complexity for two-layer neural networks to learn a generalized linear target function under Gradient Descent (SGD)
We show that overfactorization can only enhance convergence by a constant factor within this problem class.
Yet, we demonstrate that a deterministic approximation of this process adequately represents the escape time, implying that the role of SGDity may be minimal in this scenario.
- Score: 29.162265194920522
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This study explores the sample complexity for two-layer neural networks to
learn a generalized linear target function under Stochastic Gradient Descent
(SGD), focusing on the challenging regime where many flat directions are
present at initialization. It is well-established that in this scenario $n=O(d
\log d)$ samples are typically needed. However, we provide precise results
concerning the pre-factors in high-dimensional contexts and for varying widths.
Notably, our findings suggest that overparameterization can only enhance
convergence by a constant factor within this problem class. These insights are
grounded in the reduction of SGD dynamics to a stochastic process in lower
dimensions, where escaping mediocrity equates to calculating an exit time. Yet,
we demonstrate that a deterministic approximation of this process adequately
represents the escape time, implying that the role of stochasticity may be
minimal in this scenario.
Related papers
- Stochastic Collapse: How Gradient Noise Attracts SGD Dynamics Towards Simpler Subnetworks [28.87871359825978]
We reveal a strong implicit bias of gradient of descent (SGD) that drives overly expressive networks to much simplerworks.
We focus on two classes of invariant sets that correspond to simpler (sparse or low-rank)works and commonly appear in modern architectures.
We observe empirically the existence of attractive invariant sets in trained deep neural networks, implying that SGD dynamics often collapses vanishing simpleworks with either redundant neurons.
arXiv Detail & Related papers (2023-06-07T08:44:51Z) - Learning Low Dimensional State Spaces with Overparameterized Recurrent
Neural Nets [57.06026574261203]
We provide theoretical evidence for learning low-dimensional state spaces, which can also model long-term memory.
Experiments corroborate our theory, demonstrating extrapolation via learning low-dimensional state spaces with both linear and non-linear RNNs.
arXiv Detail & Related papers (2022-10-25T14:45:15Z) - SGD with Large Step Sizes Learns Sparse Features [22.959258640051342]
We showcase important features of the dynamics of the Gradient Descent (SGD) in the training of neural networks.
We show that the longer large step sizes keep SGD high in the loss landscape, the better the implicit regularization can operate and find sparse representations.
arXiv Detail & Related papers (2022-10-11T11:00:04Z) - Adaptive Sketches for Robust Regression with Importance Sampling [64.75899469557272]
We introduce data structures for solving robust regression through gradient descent (SGD)
Our algorithm effectively runs $T$ steps of SGD with importance sampling while using sublinear space and just making a single pass over the data.
arXiv Detail & Related papers (2022-07-16T03:09:30Z) - On the Double Descent of Random Features Models Trained with SGD [78.0918823643911]
We study properties of random features (RF) regression in high dimensions optimized by gradient descent (SGD)
We derive precise non-asymptotic error bounds of RF regression under both constant and adaptive step-size SGD setting.
We observe the double descent phenomenon both theoretically and empirically.
arXiv Detail & Related papers (2021-10-13T17:47:39Z) - Direction Matters: On the Implicit Bias of Stochastic Gradient Descent
with Moderate Learning Rate [105.62979485062756]
This paper attempts to characterize the particular regularization effect of SGD in the moderate learning rate regime.
We show that SGD converges along the large eigenvalue directions of the data matrix, while GD goes after the small eigenvalue directions.
arXiv Detail & Related papers (2020-11-04T21:07:52Z) - Stochastic Gradient Descent Meets Distribution Regression [0.0]
gradient descent (SGD) provides a simple and efficient way to solve a broad range of machine learning problems.
We focus on distribution regression (DR), involving two stages of sampling: Firstly, we regress from probability measures to real-valued responses.
arXiv Detail & Related papers (2020-10-24T09:03:00Z) - Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
We show that a host of variations can be covered in a unified framework that we propose.
We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
arXiv Detail & Related papers (2020-06-10T08:22:41Z) - Kernel and Rich Regimes in Overparametrized Models [69.40899443842443]
We show that gradient descent on overparametrized multilayer networks can induce rich implicit biases that are not RKHS norms.
We also demonstrate this transition empirically for more complex matrix factorization models and multilayer non-linear networks.
arXiv Detail & Related papers (2020-02-20T15:43:02Z) - Choosing the Sample with Lowest Loss makes SGD Robust [19.08973384659313]
We propose a simple variant of the simple gradient descent (SGD) method in each step.
Vanilla represents a new algorithm that is however effectively minimizing a non-current sum with the smallest loss.
Our theoretical analysis of this idea for ML problems is backed up with small-scale neural network experiments.
arXiv Detail & Related papers (2020-01-10T05:39:17Z) - Robust Learning Rate Selection for Stochastic Optimization via Splitting
Diagnostic [5.395127324484869]
SplitSGD is a new dynamic learning schedule for optimization.
The method decreases the learning rate for better adaptation to the local geometry of the objective function.
It essentially does not incur additional computational cost than standard SGD.
arXiv Detail & Related papers (2019-10-18T19:38:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.