A Theory of Non-Linear Feature Learning with One Gradient Step in Two-Layer Neural Networks
- URL: http://arxiv.org/abs/2310.07891v3
- Date: Sun, 16 Jun 2024 20:44:54 GMT
- Title: A Theory of Non-Linear Feature Learning with One Gradient Step in Two-Layer Neural Networks
- Authors: Behrad Moniri, Donghwan Lee, Hamed Hassani, Edgar Dobriban,
- Abstract summary: Feature learning is thought to be one of the fundamental reasons for the success of deep neural networks.
We show that with a learning rate that grows with the sample size, such training in fact introduces multiple rank-one components.
- Score: 43.281323350357404
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Feature learning is thought to be one of the fundamental reasons for the success of deep neural networks. It is rigorously known that in two-layer fully-connected neural networks under certain conditions, one step of gradient descent on the first layer can lead to feature learning; characterized by the appearance of a separated rank-one component -- spike -- in the spectrum of the feature matrix. However, with a constant gradient descent step size, this spike only carries information from the linear component of the target function and therefore learning non-linear components is impossible. We show that with a learning rate that grows with the sample size, such training in fact introduces multiple rank-one components, each corresponding to a specific polynomial feature. We further prove that the limiting large-dimensional and large sample training and test errors of the updated neural networks are fully characterized by these spikes. By precisely analyzing the improvement in the training and test errors, we demonstrate that these non-linear features can enhance learning.
Related papers
- Feature Averaging: An Implicit Bias of Gradient Descent Leading to Non-Robustness in Neural Networks [13.983863226803336]
We argue that "Feature Averaging" is one of the principal factors contributing to non-robustness of deep neural networks.
We provide a detailed theoretical analysis of the training dynamics of gradient descent in a two-layer ReLU network for a binary classification task.
We prove that, with the provision of more granular supervised information, a two-layer multi-class neural network is capable of learning individual features.
arXiv Detail & Related papers (2024-10-14T09:28:32Z) - How Does Gradient Descent Learn Features -- A Local Analysis for Regularized Two-Layer Neural Networks [18.809547338077905]
The ability of learning useful features is one of the major advantages of neural networks.
Recent works show that neural network can operate in a neural tangent kernel (NTK) regime that does not allow feature learning.
arXiv Detail & Related papers (2024-06-03T20:15:28Z) - Repetita Iuvant: Data Repetition Allows SGD to Learn High-Dimensional Multi-Index Functions [20.036783417617652]
We investigate the training dynamics of two-layer shallow neural networks trained with gradient-based algorithms.
We show that a simple modification of the idealized single-pass gradient descent training scenario drastically improves its computational efficiency.
Our results highlight the ability of networks to learn relevant structures from data alone without any pre-processing.
arXiv Detail & Related papers (2024-05-24T11:34:31Z) - Asymptotics of Learning with Deep Structured (Random) Features [9.366617422860543]
For a large class of feature maps we provide a tight characterisation of the test error associated with learning the readout layer.
In some cases our results can capture feature maps learned by deep, finite-width neural networks trained under gradient descent.
arXiv Detail & Related papers (2024-02-21T18:35:27Z) - Graph Neural Networks Provably Benefit from Structural Information: A
Feature Learning Perspective [53.999128831324576]
Graph neural networks (GNNs) have pioneered advancements in graph representation learning.
This study investigates the role of graph convolution within the context of feature learning theory.
arXiv Detail & Related papers (2023-06-24T10:21:11Z) - Provable Guarantees for Nonlinear Feature Learning in Three-Layer Neural
Networks [49.808194368781095]
We show that three-layer neural networks have provably richer feature learning capabilities than two-layer networks.
This work makes progress towards understanding the provable benefit of three-layer neural networks over two-layer networks in the feature learning regime.
arXiv Detail & Related papers (2023-05-11T17:19:30Z) - Multi-scale Feature Learning Dynamics: Insights for Double Descent [71.91871020059857]
We study the phenomenon of "double descent" of the generalization error.
We find that double descent can be attributed to distinct features being learned at different scales.
arXiv Detail & Related papers (2021-12-06T18:17:08Z) - Optimization-Based Separations for Neural Networks [57.875347246373956]
We show that gradient descent can efficiently learn ball indicator functions using a depth 2 neural network with two layers of sigmoidal activations.
This is the first optimization-based separation result where the approximation benefits of the stronger architecture provably manifest in practice.
arXiv Detail & Related papers (2021-12-04T18:07:47Z) - Learning Neural Network Subspaces [74.44457651546728]
Recent observations have advanced our understanding of the neural network optimization landscape.
With a similar computational cost as training one model, we learn lines, curves, and simplexes of high-accuracy neural networks.
With a similar computational cost as training one model, we learn lines, curves, and simplexes of high-accuracy neural networks.
arXiv Detail & Related papers (2021-02-20T23:26:58Z) - The Surprising Simplicity of the Early-Time Learning Dynamics of Neural
Networks [43.860358308049044]
In work, we show that these common perceptions can be completely false in the early phase of learning.
We argue that this surprising simplicity can persist in networks with more layers with convolutional architecture.
arXiv Detail & Related papers (2020-06-25T17:42:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.