Related papers: How Does Gradient Descent Learn Features -- A Local Analysis for Regularized Two-Layer Neural Networks

How Does Gradient Descent Learn Features -- A Local Analysis for Regularized Two-Layer Neural Networks

URL: http://arxiv.org/abs/2406.01766v2
Date: Mon, 04 Nov 2024 23:02:25 GMT
Title: How Does Gradient Descent Learn Features -- A Local Analysis for Regularized Two-Layer Neural Networks
Authors: Mo Zhou, Rong Ge,
Abstract summary: The ability of learning useful features is one of the major advantages of neural networks. Recent works show that neural network can operate in a neural tangent kernel (NTK) regime that does not allow feature learning.
Score: 18.809547338077905
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The ability of learning useful features is one of the major advantages of neural networks. Although recent works show that neural network can operate in a neural tangent kernel (NTK) regime that does not allow feature learning, many works also demonstrate the potential for neural networks to go beyond NTK regime and perform feature learning. Recently, a line of work highlighted the feature learning capabilities of the early stages of gradient-based training. In this paper we consider another mechanism for feature learning via gradient descent through a local convergence analysis. We show that once the loss is below a certain threshold, gradient descent with a carefully regularized objective will capture ground-truth directions. We further strengthen this local convergence analysis by incorporating early-stage feature learning analysis. Our results demonstrate that feature learning not only happens at the initial gradient steps, but can also occur towards the end of training.

Related papers

Global Convergence and Rich Feature Learning in $L$-Layer Infinite-Width Neural Networks under $μ$P Parametrization [66.03821840425539]
In this paper, we investigate the training dynamics of $L$-layer neural networks using the tensor gradient program (SGD) framework. We show that SGD enables these networks to learn linearly independent features that substantially deviate from their initial values. This rich feature space captures relevant data information and ensures that any convergent point of the training process is a global minimum.
arXiv Detail & Related papers (2025-03-12T17:33:13Z)
Coding schemes in neural networks learning classification tasks [52.22978725954347]
We investigate fully-connected, wide neural networks learning classification tasks. We show that the networks acquire strong, data-dependent features. Surprisingly, the nature of the internal representations depends crucially on the neuronal nonlinearity.
arXiv Detail & Related papers (2024-06-24T14:50:05Z)
Demystifying Lazy Training of Neural Networks from a Macroscopic Viewpoint [5.9954962391837885]
We study the gradient descent dynamics of neural networks through the lens of macroscopic limits. Our study reveals that gradient descent can rapidly drive deep neural networks to zero training loss. Our approach draws inspiration from the Neural Tangent Kernel (NTK) paradigm.
arXiv Detail & Related papers (2024-04-07T08:07:02Z)
Provable Guarantees for Neural Networks via Gradient Feature Learning [15.413985018920018]
This work proposes a unified analysis framework for two-layer networks trained by gradient descent. The framework is centered around the principle of feature learning from prototypical gradients, and its effectiveness is demonstrated by applications in several problems.
arXiv Detail & Related papers (2023-10-19T01:45:37Z)
A Theory of Non-Linear Feature Learning with One Gradient Step in Two-Layer Neural Networks [43.281323350357404]
Feature learning is thought to be one of the fundamental reasons for the success of deep neural networks. We show that with a learning rate that grows with the sample size, such training in fact introduces multiple rank-one components.
arXiv Detail & Related papers (2023-10-11T20:55:02Z)
Graph Neural Networks Provably Benefit from Structural Information: A Feature Learning Perspective [53.999128831324576]
Graph neural networks (GNNs) have pioneered advancements in graph representation learning. This study investigates the role of graph convolution within the context of feature learning theory.
arXiv Detail & Related papers (2023-06-24T10:21:11Z)
Gradient Descent in Neural Networks as Sequential Learning in RKBS [63.011641517977644]
We construct an exact power-series representation of the neural network in a finite neighborhood of the initial weights. We prove that, regardless of width, the training sequence produced by gradient descent can be exactly replicated by regularized sequential learning.
arXiv Detail & Related papers (2023-02-01T03:18:07Z)
The Connection Between Approximation, Depth Separation and Learnability in Neural Networks [70.55686685872008]
We study the connection between learnability and approximation capacity. We show that learnability with deep networks of a target function depends on the ability of simpler classes to approximate the target.
arXiv Detail & Related papers (2021-01-31T11:32:30Z)
Training Convolutional Neural Networks With Hebbian Principal Component Analysis [10.026753669198108]
Hebbian learning can be used for training the lower or the higher layers of a neural network. We use a nonlinear Hebbian Principal Component Analysis ( HPCA) learning rule, in place of the Hebbian Winner Takes All (HWTA) strategy. In particular, the HPCA rule is used to train Convolutional Neural Networks in order to extract relevant features from the CIFAR-10 image dataset.
arXiv Detail & Related papers (2020-12-22T18:17:46Z)
Gradient Starvation: A Learning Proclivity in Neural Networks [97.02382916372594]
Gradient Starvation arises when cross-entropy loss is minimized by capturing only a subset of features relevant for the task. This work provides a theoretical explanation for the emergence of such feature imbalance in neural networks.
arXiv Detail & Related papers (2020-11-18T18:52:08Z)
A Generalized Neural Tangent Kernel Analysis for Two-layer Neural Networks [87.23360438947114]
We show that noisy gradient descent with weight decay can still exhibit a " Kernel-like" behavior. This implies that the training loss converges linearly up to a certain accuracy. We also establish a novel generalization error bound for two-layer neural networks trained by noisy gradient descent with weight decay.
arXiv Detail & Related papers (2020-02-10T18:56:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.