Random Feature Amplification: Feature Learning and Generalization in
Neural Networks
- URL: http://arxiv.org/abs/2202.07626v4
- Date: Thu, 14 Sep 2023 02:48:36 GMT
- Title: Random Feature Amplification: Feature Learning and Generalization in
Neural Networks
- Authors: Spencer Frei, Niladri S. Chatterji, Peter L. Bartlett
- Abstract summary: We provide a characterization of the feature-learning process in two-layer ReLU networks trained by gradient descent.
We show that, although linear classifiers are no better than random guessing for the distribution we consider, two-layer ReLU networks trained by gradient descent achieve generalization error close to the label noise rate.
- Score: 44.431266188350655
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this work, we provide a characterization of the feature-learning process
in two-layer ReLU networks trained by gradient descent on the logistic loss
following random initialization. We consider data with binary labels that are
generated by an XOR-like function of the input features. We permit a constant
fraction of the training labels to be corrupted by an adversary. We show that,
although linear classifiers are no better than random guessing for the
distribution we consider, two-layer ReLU networks trained by gradient descent
achieve generalization error close to the label noise rate. We develop a novel
proof technique that shows that at initialization, the vast majority of neurons
function as random features that are only weakly correlated with useful
features, and the gradient descent dynamics 'amplify' these weak, random
features to strong, useful features.
Related papers
- Benign Overfitting for Regression with Trained Two-Layer ReLU Networks [14.36840959836957]
We study the least-square regression problem with a two-layer fully-connected neural network, with ReLU activation function, trained by gradient flow.
Our first result is a generalization result, that requires no assumptions on the underlying regression function or the noise other than that they are bounded.
arXiv Detail & Related papers (2024-10-08T16:54:23Z) - Asymptotics of feature learning in two-layer networks after one gradient-step [39.02152620420932]
We investigate how two-layer neural networks learn from data, and improve over the kernel regime.
We model the trained network by a spiked Random Features (sRF) model.
We provide an exact description of the generalization error of the sRF in the high-dimensional limit.
arXiv Detail & Related papers (2024-02-07T15:57:30Z) - Implicit Bias of Gradient Descent for Two-layer ReLU and Leaky ReLU
Networks on Nearly-orthogonal Data [66.1211659120882]
The implicit bias towards solutions with favorable properties is believed to be a key reason why neural networks trained by gradient-based optimization can generalize well.
While the implicit bias of gradient flow has been widely studied for homogeneous neural networks (including ReLU and leaky ReLU networks), the implicit bias of gradient descent is currently only understood for smooth neural networks.
arXiv Detail & Related papers (2023-10-29T08:47:48Z) - Grokking as the Transition from Lazy to Rich Training Dynamics [35.186196991224286]
grokking occurs when the train loss of a neural network decreases much earlier than its test loss.
Key determinants of grokking are the rate of feature learning and the alignment of the initial features with the target function.
arXiv Detail & Related papers (2023-10-09T19:33:21Z) - Neural networks trained with SGD learn distributions of increasing
complexity [78.30235086565388]
We show that neural networks trained using gradient descent initially classify their inputs using lower-order input statistics.
We then exploit higher-order statistics only later during training.
We discuss the relation of DSB to other simplicity biases and consider its implications for the principle of universality in learning.
arXiv Detail & Related papers (2022-11-21T15:27:22Z) - Implicit Bias in Leaky ReLU Networks Trained on High-Dimensional Data [63.34506218832164]
In this work, we investigate the implicit bias of gradient flow and gradient descent in two-layer fully-connected neural networks with ReLU activations.
For gradient flow, we leverage recent work on the implicit bias for homogeneous neural networks to show that leakyally, gradient flow produces a neural network with rank at most two.
For gradient descent, provided the random variance is small enough, we show that a single step of gradient descent suffices to drastically reduce the rank of the network, and that the rank remains small throughout training.
arXiv Detail & Related papers (2022-10-13T15:09:54Z) - Learning an Invertible Output Mapping Can Mitigate Simplicity Bias in
Neural Networks [66.76034024335833]
We investigate why diverse/ complex features are learned by the backbone, and their brittleness is due to the linear classification head relying primarily on the simplest features.
We propose Feature Reconstruction Regularizer (FRR) to ensure that the learned features can be reconstructed back from the logits.
We demonstrate up to 15% gains in OOD accuracy on the recently introduced semi-synthetic datasets with extreme distribution shifts.
arXiv Detail & Related papers (2022-10-04T04:01:15Z) - Benign Overfitting without Linearity: Neural Network Classifiers Trained
by Gradient Descent for Noisy Linear Data [44.431266188350655]
We consider the generalization error of two-layer neural networks trained to generalize by gradient descent.
We show that neural networks exhibit benign overfitting: they can be driven to zero training error, perfectly fitting any noisy training labels, and simultaneously achieve minimax optimal test error.
In contrast to previous work on benign overfitting that require linear or kernel-based predictors, our analysis holds in a setting where both the model and learning dynamics are fundamentally nonlinear.
arXiv Detail & Related papers (2022-02-11T23:04:00Z) - Learning Frequency Domain Approximation for Binary Neural Networks [68.79904499480025]
We propose to estimate the gradient of sign function in the Fourier frequency domain using the combination of sine functions for training BNNs.
The experiments on several benchmark datasets and neural architectures illustrate that the binary network learned using our method achieves the state-of-the-art accuracy.
arXiv Detail & Related papers (2021-03-01T08:25:26Z) - Implicit Bias of Gradient Descent for Wide Two-layer Neural Networks
Trained with the Logistic Loss [0.0]
Neural networks trained to minimize the logistic (a.k.a. cross-entropy) loss with gradient-based methods are observed to perform well in many supervised classification tasks.
We analyze the training and generalization behavior of infinitely wide two-layer neural networks with homogeneous activations.
arXiv Detail & Related papers (2020-02-11T15:42:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.