Related papers: A Theory of Non-Linear Feature Learning with One Gradient Step in Two-Layer Neural Networks

A Theory of Non-Linear Feature Learning with One Gradient Step in Two-Layer Neural Networks

URL: http://arxiv.org/abs/2310.07891v3
Date: Sun, 16 Jun 2024 20:44:54 GMT
Title: A Theory of Non-Linear Feature Learning with One Gradient Step in Two-Layer Neural Networks
Authors: Behrad Moniri, Donghwan Lee, Hamed Hassani, Edgar Dobriban,
Abstract summary: Feature learning is thought to be one of the fundamental reasons for the success of deep neural networks. We show that with a learning rate that grows with the sample size, such training in fact introduces multiple rank-one components.
Score: 43.281323350357404
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Feature learning is thought to be one of the fundamental reasons for the success of deep neural networks. It is rigorously known that in two-layer fully-connected neural networks under certain conditions, one step of gradient descent on the first layer can lead to feature learning; characterized by the appearance of a separated rank-one component -- spike -- in the spectrum of the feature matrix. However, with a constant gradient descent step size, this spike only carries information from the linear component of the target function and therefore learning non-linear components is impossible. We show that with a learning rate that grows with the sample size, such training in fact introduces multiple rank-one components, each corresponding to a specific polynomial feature. We further prove that the limiting large-dimensional and large sample training and test errors of the updated neural networks are fully characterized by these spikes. By precisely analyzing the improvement in the training and test errors, we demonstrate that these non-linear features can enhance learning.

Related papers

Feature Averaging: An Implicit Bias of Gradient Descent Leading to Non-Robustness in Neural Networks [13.983863226803336]
We argue that "Feature Averaging" is one of the principal factors contributing to non-robustness of deep neural networks. We provide a detailed theoretical analysis of the training dynamics of gradient descent in a two-layer ReLU network for a binary classification task. We prove that, with the provision of more granular supervised information, a two-layer multi-class neural network is capable of learning individual features.
arXiv Detail & Related papers (2024-10-14T09:28:32Z)
Coding schemes in neural networks learning classification tasks [52.22978725954347]
We investigate fully-connected, wide neural networks learning classification tasks. We show that the networks acquire strong, data-dependent features. Surprisingly, the nature of the internal representations depends crucially on the neuronal nonlinearity.
arXiv Detail & Related papers (2024-06-24T14:50:05Z)
How Does Gradient Descent Learn Features -- A Local Analysis for Regularized Two-Layer Neural Networks [18.809547338077905]
The ability of learning useful features is one of the major advantages of neural networks. Recent works show that neural network can operate in a neural tangent kernel (NTK) regime that does not allow feature learning.
arXiv Detail & Related papers (2024-06-03T20:15:28Z)
Repetita Iuvant: Data Repetition Allows SGD to Learn High-Dimensional Multi-Index Functions [20.036783417617652]
We investigate the training dynamics of two-layer shallow neural networks trained with gradient-based algorithms. We show that a simple modification of the idealized single-pass gradient descent training scenario drastically improves its computational efficiency. Our results highlight the ability of networks to learn relevant structures from data alone without any pre-processing.
arXiv Detail & Related papers (2024-05-24T11:34:31Z)
Asymptotics of Learning with Deep Structured (Random) Features [9.366617422860543]
For a large class of feature maps we provide a tight characterisation of the test error associated with learning the readout layer. In some cases our results can capture feature maps learned by deep, finite-width neural networks trained under gradient descent.
arXiv Detail & Related papers (2024-02-21T18:35:27Z)
Understanding Deep Representation Learning via Layerwise Feature Compression and Discrimination [33.273226655730326]
We show that each layer of a deep linear network progressively compresses within-class features at a geometric rate and discriminates between-class features at a linear rate.<n>This is the first quantitative characterization of feature evolution in hierarchical representations of deep linear networks.
arXiv Detail & Related papers (2023-11-06T09:00:38Z)
Graph Neural Networks Provably Benefit from Structural Information: A Feature Learning Perspective [53.999128831324576]
Graph neural networks (GNNs) have pioneered advancements in graph representation learning. This study investigates the role of graph convolution within the context of feature learning theory.
arXiv Detail & Related papers (2023-06-24T10:21:11Z)
Provable Guarantees for Nonlinear Feature Learning in Three-Layer Neural Networks [49.808194368781095]
We show that three-layer neural networks have provably richer feature learning capabilities than two-layer networks. This work makes progress towards understanding the provable benefit of three-layer neural networks over two-layer networks in the feature learning regime.
arXiv Detail & Related papers (2023-05-11T17:19:30Z)
Exploring Linear Feature Disentanglement For Neural Networks [63.20827189693117]
Non-linear activation functions, e.g., Sigmoid, ReLU, and Tanh, have achieved great success in neural networks (NNs) Due to the complex non-linear characteristic of samples, the objective of those activation functions is to project samples from their original feature space to a linear separable feature space. This phenomenon ignites our interest in exploring whether all features need to be transformed by all non-linear functions in current typical NNs.
arXiv Detail & Related papers (2022-03-22T13:09:17Z)
Multi-scale Feature Learning Dynamics: Insights for Double Descent [71.91871020059857]
We study the phenomenon of "double descent" of the generalization error. We find that double descent can be attributed to distinct features being learned at different scales.
arXiv Detail & Related papers (2021-12-06T18:17:08Z)
Optimization-Based Separations for Neural Networks [57.875347246373956]
We show that gradient descent can efficiently learn ball indicator functions using a depth 2 neural network with two layers of sigmoidal activations. This is the first optimization-based separation result where the approximation benefits of the stronger architecture provably manifest in practice.
arXiv Detail & Related papers (2021-12-04T18:07:47Z)
Learning Neural Network Subspaces [74.44457651546728]
Recent observations have advanced our understanding of the neural network optimization landscape. With a similar computational cost as training one model, we learn lines, curves, and simplexes of high-accuracy neural networks. With a similar computational cost as training one model, we learn lines, curves, and simplexes of high-accuracy neural networks.
arXiv Detail & Related papers (2021-02-20T23:26:58Z)
Over-parametrized neural networks as under-determined linear systems [31.69089186688224]
We show that it is unsurprising simple neural networks can achieve zero training loss. We show that kernels typically associated with the ReLU activation function have fundamental flaws. We propose new activation functions that avoid the pitfalls of ReLU in that they admit zero training loss solutions for any set of distinct data points.
arXiv Detail & Related papers (2020-10-29T21:43:00Z)
The Surprising Simplicity of the Early-Time Learning Dynamics of Neural Networks [43.860358308049044]
In work, we show that these common perceptions can be completely false in the early phase of learning. We argue that this surprising simplicity can persist in networks with more layers with convolutional architecture.
arXiv Detail & Related papers (2020-06-25T17:42:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.