More Than a Toy: Random Matrix Models Predict How Real-World Neural
Representations Generalize
- URL: http://arxiv.org/abs/2203.06176v1
- Date: Fri, 11 Mar 2022 18:59:01 GMT
- Title: More Than a Toy: Random Matrix Models Predict How Real-World Neural
Representations Generalize
- Authors: Alexander Wei and Wei Hu and Jacob Steinhardt
- Abstract summary: We find that most theoretical analyses fall short of capturing qualitative phenomena even for kernel regression.
We prove that the classical GCV estimator converges to the generalization risk whenever a local random matrix law holds.
Our findings suggest that random matrix theory may be central to understanding the properties of neural representations in practice.
- Score: 94.70343385404203
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Of theories for why large-scale machine learning models generalize despite
being vastly overparameterized, which of their assumptions are needed to
capture the qualitative phenomena of generalization in the real world? On one
hand, we find that most theoretical analyses fall short of capturing these
qualitative phenomena even for kernel regression, when applied to kernels
derived from large-scale neural networks (e.g., ResNet-50) and real data (e.g.,
CIFAR-100). On the other hand, we find that the classical GCV estimator (Craven
and Wahba, 1978) accurately predicts generalization risk even in such
overparameterized settings. To bolster this empirical finding, we prove that
the GCV estimator converges to the generalization risk whenever a local random
matrix law holds. Finally, we apply this random matrix theory lens to explain
why pretrained representations generalize better as well as what factors govern
scaling laws for kernel regression. Our findings suggest that random matrix
theory, rather than just being a toy model, may be central to understanding the
properties of neural representations in practice.
Related papers
- Contraction rates for conjugate gradient and Lanczos approximate posteriors in Gaussian process regression [0.0]
We analyze a class of recently proposed approximation algorithms from the field of Probabilistic numerics.
We combine result from the numerical analysis literature with state of the art concentration results for spectra of kernel matrices to obtain minimax contraction rates.
arXiv Detail & Related papers (2024-06-18T14:50:42Z) - Scaling and renormalization in high-dimensional regression [72.59731158970894]
This paper presents a succinct derivation of the training and generalization performance of a variety of high-dimensional ridge regression models.
We provide an introduction and review of recent results on these topics, aimed at readers with backgrounds in physics and deep learning.
arXiv Detail & Related papers (2024-05-01T15:59:00Z) - Generalization in Kernel Regression Under Realistic Assumptions [41.345620270267446]
We provide rigorous bounds for common kernels and for any amount of regularization, noise, any input dimension, and any number of samples.
Our results imply benign overfitting in high input dimensions, nearly tempered overfitting in fixed dimensions, and explicit convergence rates for regularized regression.
As a by-product, we obtain time-dependent bounds for neural networks trained in the kernel regime.
arXiv Detail & Related papers (2023-12-26T10:55:20Z) - Instance-Dependent Generalization Bounds via Optimal Transport [51.71650746285469]
Existing generalization bounds fail to explain crucial factors that drive the generalization of modern neural networks.
We derive instance-dependent generalization bounds that depend on the local Lipschitz regularity of the learned prediction function in the data space.
We empirically analyze our generalization bounds for neural networks, showing that the bound values are meaningful and capture the effect of popular regularization methods during training.
arXiv Detail & Related papers (2022-11-02T16:39:42Z) - Predicting Unreliable Predictions by Shattering a Neural Network [145.3823991041987]
Piecewise linear neural networks can be split into subfunctions.
Subfunctions have their own activation pattern, domain, and empirical error.
Empirical error for the full network can be written as an expectation over subfunctions.
arXiv Detail & Related papers (2021-06-15T18:34:41Z) - Towards an Understanding of Benign Overfitting in Neural Networks [104.2956323934544]
Modern machine learning models often employ a huge number of parameters and are typically optimized to have zero training loss.
We examine how these benign overfitting phenomena occur in a two-layer neural network setting.
We show that it is possible for the two-layer ReLU network interpolator to achieve a near minimax-optimal learning rate.
arXiv Detail & Related papers (2021-06-06T19:08:53Z) - Out-of-Distribution Generalization in Kernel Regression [21.958028127426196]
We study generalization in kernel regression when the training and test distributions are different.
We identify an overlap matrix that quantifies the mismatch between distributions for a given kernel.
We develop procedures for optimizing training and test distributions for a given data budget to find best and worst case generalizations under the shift.
arXiv Detail & Related papers (2021-06-04T04:54:25Z) - Benign overfitting in ridge regression [0.0]
We provide non-asymptotic generalization bounds for overparametrized ridge regression.
We identify when small or negative regularization is sufficient for obtaining small generalization error.
arXiv Detail & Related papers (2020-09-29T20:00:31Z) - Spectral Bias and Task-Model Alignment Explain Generalization in Kernel
Regression and Infinitely Wide Neural Networks [17.188280334580195]
Generalization beyond a training dataset is a main goal of machine learning.
Recent observations in deep neural networks contradict conventional wisdom from classical statistics.
We show that more data may impair generalization when noisy or not expressible by the kernel.
arXiv Detail & Related papers (2020-06-23T17:53:11Z) - Robust Compressed Sensing using Generative Models [98.64228459705859]
In this paper we propose an algorithm inspired by the Median-of-Means (MOM)
Our algorithm guarantees recovery for heavy-tailed data, even in the presence of outliers.
arXiv Detail & Related papers (2020-06-16T19:07:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.