Spectral Evolution and Invariance in Linear-width Neural Networks
- URL: http://arxiv.org/abs/2211.06506v2
- Date: Wed, 8 Nov 2023 04:50:55 GMT
- Title: Spectral Evolution and Invariance in Linear-width Neural Networks
- Authors: Zhichao Wang, Andrew Engel, Anand Sarwate, Ioana Dumitriu, Tony Chiang
- Abstract summary: We investigate the spectral properties of linear-width feed-forward neural networks.
We show that the spectra of weight in this high dimensional regime are invariant when trained by gradient descent for small constant learning rates.
We also show that after adaptive gradient training, where a lower test error and feature learning emerge, both weight and kernel exhibit heavy tail behavior.
- Score: 8.419660614226816
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We investigate the spectral properties of linear-width feed-forward neural
networks, where the sample size is asymptotically proportional to network
width. Empirically, we show that the spectra of weight in this high dimensional
regime are invariant when trained by gradient descent for small constant
learning rates; we provide a theoretical justification for this observation and
prove the invariance of the bulk spectra for both conjugate and neural tangent
kernels. We demonstrate similar characteristics when training with stochastic
gradient descent with small learning rates. When the learning rate is large, we
exhibit the emergence of an outlier whose corresponding eigenvector is aligned
with the training data structure. We also show that after adaptive gradient
training, where a lower test error and feature learning emerge, both weight and
kernel matrices exhibit heavy tail behavior. Simple examples are provided to
explain when heavy tails can have better generalizations. We exhibit different
spectral properties such as invariant bulk, spike, and heavy-tailed
distribution from a two-layer neural network using different training
strategies, and then correlate them to the feature learning. Analogous
phenomena also appear when we train conventional neural networks with
real-world data. We conclude that monitoring the evolution of the spectra
during training is an essential step toward understanding the training dynamics
and feature learning.
Related papers
- Asymptotics of Learning with Deep Structured (Random) Features [9.366617422860543]
For a large class of feature maps we provide a tight characterisation of the test error associated with learning the readout layer.
In some cases our results can capture feature maps learned by deep, finite-width neural networks trained under gradient descent.
arXiv Detail & Related papers (2024-02-21T18:35:27Z) - Assessing Neural Network Representations During Training Using
Noise-Resilient Diffusion Spectral Entropy [55.014926694758195]
Entropy and mutual information in neural networks provide rich information on the learning process.
We leverage data geometry to access the underlying manifold and reliably compute these information-theoretic measures.
We show that they form noise-resistant measures of intrinsic dimensionality and relationship strength in high-dimensional simulated data.
arXiv Detail & Related papers (2023-12-04T01:32:42Z) - A theory of data variability in Neural Network Bayesian inference [0.70224924046445]
We provide a field-theoretic formalism which covers the generalization properties of infinitely wide networks.
We derive the generalization properties from the statistical properties of the input.
We show that data variability leads to a non-Gaussian action reminiscent of a ($varphi3+varphi4$)-theory.
arXiv Detail & Related papers (2023-07-31T14:11:32Z) - Neural networks trained with SGD learn distributions of increasing
complexity [78.30235086565388]
We show that neural networks trained using gradient descent initially classify their inputs using lower-order input statistics.
We then exploit higher-order statistics only later during training.
We discuss the relation of DSB to other simplicity biases and consider its implications for the principle of universality in learning.
arXiv Detail & Related papers (2022-11-21T15:27:22Z) - Benign Overfitting without Linearity: Neural Network Classifiers Trained
by Gradient Descent for Noisy Linear Data [44.431266188350655]
We consider the generalization error of two-layer neural networks trained to generalize by gradient descent.
We show that neural networks exhibit benign overfitting: they can be driven to zero training error, perfectly fitting any noisy training labels, and simultaneously achieve minimax optimal test error.
In contrast to previous work on benign overfitting that require linear or kernel-based predictors, our analysis holds in a setting where both the model and learning dynamics are fundamentally nonlinear.
arXiv Detail & Related papers (2022-02-11T23:04:00Z) - Data-driven emergence of convolutional structure in neural networks [83.4920717252233]
We show how fully-connected neural networks solving a discrimination task can learn a convolutional structure directly from their inputs.
By carefully designing data models, we show that the emergence of this pattern is triggered by the non-Gaussian, higher-order local structure of the inputs.
arXiv Detail & Related papers (2022-02-01T17:11:13Z) - Multi-scale Feature Learning Dynamics: Insights for Double Descent [71.91871020059857]
We study the phenomenon of "double descent" of the generalization error.
We find that double descent can be attributed to distinct features being learned at different scales.
arXiv Detail & Related papers (2021-12-06T18:17:08Z) - A Bayesian Perspective on Training Speed and Model Selection [51.15664724311443]
We show that a measure of a model's training speed can be used to estimate its marginal likelihood.
We verify our results in model selection tasks for linear models and for the infinite-width limit of deep neural networks.
Our results suggest a promising new direction towards explaining why neural networks trained with gradient descent are biased towards functions that generalize well.
arXiv Detail & Related papers (2020-10-27T17:56:14Z) - Kernel and Rich Regimes in Overparametrized Models [69.40899443842443]
We show that gradient descent on overparametrized multilayer networks can induce rich implicit biases that are not RKHS norms.
We also demonstrate this transition empirically for more complex matrix factorization models and multilayer non-linear networks.
arXiv Detail & Related papers (2020-02-20T15:43:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.