Deformed semicircle law and concentration of nonlinear random matrices
for ultra-wide neural networks
- URL: http://arxiv.org/abs/2109.09304v3
- Date: Fri, 14 Apr 2023 04:36:27 GMT
- Title: Deformed semicircle law and concentration of nonlinear random matrices
for ultra-wide neural networks
- Authors: Zhichao Wang and Yizhe Zhu
- Abstract summary: We study the limiting spectral distributions of two empirical kernel matrices associated with $f(X)$.
We show that random feature regression induced by the empirical kernel achieves the same performance as its limiting kernel regression under the ultra-wide regime.
- Score: 29.03095282348978
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we investigate a two-layer fully connected neural network of
the form $f(X)=\frac{1}{\sqrt{d_1}}\boldsymbol{a}^\top \sigma\left(WX\right)$,
where $X\in\mathbb{R}^{d_0\times n}$ is a deterministic data matrix,
$W\in\mathbb{R}^{d_1\times d_0}$ and $\boldsymbol{a}\in\mathbb{R}^{d_1}$ are
random Gaussian weights, and $\sigma$ is a nonlinear activation function. We
study the limiting spectral distributions of two empirical kernel matrices
associated with $f(X)$: the empirical conjugate kernel (CK) and neural tangent
kernel (NTK), beyond the linear-width regime ($d_1\asymp n$). We focus on the
$\textit{ultra-wide regime}$, where the width $d_1$ of the first layer is much
larger than the sample size $n$. Under appropriate assumptions on $X$ and
$\sigma$, a deformed semicircle law emerges as $d_1/n\to\infty$ and
$n\to\infty$. We first prove this limiting law for generalized sample
covariance matrices with some dependency. To specify it for our neural network
model, we provide a nonlinear Hanson-Wright inequality that is suitable for
neural networks with random weights and Lipschitz activation functions. We also
demonstrate non-asymptotic concentrations of the empirical CK and NTK around
their limiting kernels in the spectral norm, along with lower bounds on their
smallest eigenvalues. As an application, we show that random feature regression
induced by the empirical kernel achieves the same asymptotic performance as its
limiting kernel regression under the ultra-wide regime. This allows us to
calculate the asymptotic training and test errors for random feature regression
using the corresponding kernel regression.
Related papers
- Universality of kernel random matrices and kernel regression in the quadratic regime [18.51014786894174]
In this work, we extend the study of kernel kernel regression to the quadratic regime.
We establish an operator norm approximation bound for the difference between the original kernel random matrix and a quadratic kernel random matrix.
We characterize the precise training and generalization errors for KRR in the quadratic regime when $n/d2$ converges to a nonzero constant.
arXiv Detail & Related papers (2024-08-02T07:29:49Z) - A Unified Framework for Uniform Signal Recovery in Nonlinear Generative
Compressed Sensing [68.80803866919123]
Under nonlinear measurements, most prior results are non-uniform, i.e., they hold with high probability for a fixed $mathbfx*$ rather than for all $mathbfx*$ simultaneously.
Our framework accommodates GCS with 1-bit/uniformly quantized observations and single index models as canonical examples.
We also develop a concentration inequality that produces tighter bounds for product processes whose index sets have low metric entropy.
arXiv Detail & Related papers (2023-09-25T17:54:19Z) - Optimal Rate of Kernel Regression in Large Dimensions [13.641780902673792]
We first build a general tool to characterize the upper bound and the minimax lower bound of kernel regression for large dimensional data.
We utilize the new tool to show that the minimax rate of the excess risk of kernel regression is $n-1/2$ when $nasymp dgamma$ for $gamma =2, 4, 6, 8, cdots$.
arXiv Detail & Related papers (2023-09-08T11:29:05Z) - LU decomposition and Toeplitz decomposition of a neural network [5.276232626689567]
We show that any continuous function $f : mathbbRn to mathbbRm$ has an approximation to arbitrary accuracy by a neural network.
A consequence of our Toeplitz result is a fixed-width universal approximation for convolutional neural networks.
arXiv Detail & Related papers (2022-11-25T07:26:39Z) - Learning a Single Neuron with Adversarial Label Noise via Gradient
Descent [50.659479930171585]
We study a function of the form $mathbfxmapstosigma(mathbfwcdotmathbfx)$ for monotone activations.
The goal of the learner is to output a hypothesis vector $mathbfw$ that $F(mathbbw)=C, epsilon$ with high probability.
arXiv Detail & Related papers (2022-06-17T17:55:43Z) - High-dimensional Asymptotics of Feature Learning: How One Gradient Step
Improves the Representation [89.21686761957383]
We study the first gradient descent step on the first-layer parameters $boldsymbolW$ in a two-layer network.
Our results demonstrate that even one step can lead to a considerable advantage over random features.
arXiv Detail & Related papers (2022-05-03T12:09:59Z) - Largest Eigenvalues of the Conjugate Kernel of Single-Layered Neural
Networks [0.0]
We show that the largest eigenvalue has the same limit (in probability) as that of some well-known linear random matrix ensembles.
This may be of interest for applications to machine learning.
arXiv Detail & Related papers (2022-01-13T00:48:20Z) - Random matrices in service of ML footprint: ternary random features with
no performance loss [55.30329197651178]
We show that the eigenspectrum of $bf K$ is independent of the distribution of the i.i.d. entries of $bf w$.
We propose a novel random technique, called Ternary Random Feature (TRF)
The computation of the proposed random features requires no multiplication and a factor of $b$ less bits for storage compared to classical random features.
arXiv Detail & Related papers (2021-10-05T09:33:49Z) - Learning Over-Parametrized Two-Layer ReLU Neural Networks beyond NTK [58.5766737343951]
We consider the dynamic of descent for learning a two-layer neural network.
We show that an over-parametrized two-layer neural network can provably learn with gradient loss at most ground with Tangent samples.
arXiv Detail & Related papers (2020-07-09T07:09:28Z) - Agnostic Learning of a Single Neuron with Gradient Descent [92.7662890047311]
We consider the problem of learning the best-fitting single neuron as measured by the expected square loss.
For the ReLU activation, our population risk guarantee is $O(mathsfOPT1/2)+epsilon$.
For the ReLU activation, our population risk guarantee is $O(mathsfOPT1/2)+epsilon$.
arXiv Detail & Related papers (2020-05-29T07:20:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.