Eigenvalue distribution of the Neural Tangent Kernel in the quadratic scaling
- URL: http://arxiv.org/abs/2508.20036v1
- Date: Wed, 27 Aug 2025 16:41:01 GMT
- Title: Eigenvalue distribution of the Neural Tangent Kernel in the quadratic scaling
- Authors: Lucas Benigni, Elliot Paquette,
- Abstract summary: We compute the eigenvalue distribution of the neural tangent kernel of a two-layer neural network under a specific scaling of dimension.<n>We describe the distribution as the free multiplicative convolution of the Marchenko--Pastur distribution with a deterministic distribution depending on $sigma$ and $D$.
- Score: 5.142160533428576
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We compute the asymptotic eigenvalue distribution of the neural tangent kernel of a two-layer neural network under a specific scaling of dimension. Namely, if $X\in\mathbb{R}^{n\times d}$ is an i.i.d random matrix, $W\in\mathbb{R}^{d\times p}$ is an i.i.d $\mathcal{N}(0,1)$ matrix and $D\in\mathbb{R}^{p\times p}$ is a diagonal matrix with i.i.d bounded entries, we consider the matrix \[ \mathrm{NTK} = \frac{1}{d}XX^\top \odot \frac{1}{p} \sigma'\left( \frac{1}{\sqrt{d}}XW \right)D^2 \sigma'\left( \frac{1}{\sqrt{d}}XW \right)^\top \] where $\sigma'$ is a pseudo-Lipschitz function applied entrywise and under the scaling $\frac{n}{dp}\to \gamma_1$ and $\frac{p}{d}\to \gamma_2$. We describe the asymptotic distribution as the free multiplicative convolution of the Marchenko--Pastur distribution with a deterministic distribution depending on $\sigma$ and $D$.
Related papers
- MLPs at the EOC: Spectrum of the NTK [7.826806223782053]
We study the properties of the Neuralstyle (NTK) $oversetscriptscriptstyleinftyK.<n>$Delta_phi = fracb2a2+b2$ determines the rate at which the condition number of the NTK matrix converges to its limit as depth increases.
arXiv Detail & Related papers (2025-01-22T21:12:51Z) - In-depth Analysis of Low-rank Matrix Factorisation in a Federated Setting [21.002519159190538]
We analyze a distributed algorithm to compute a low-rank matrix factorization on $N$ clients.<n>We obtain a global $mathbfV$ in $mathbbRd times r$ common to all clients and a local $mathbfUi$ in $mathbbRn_itimes r$.
arXiv Detail & Related papers (2024-09-13T12:28:42Z) - Provably learning a multi-head attention layer [55.2904547651831]
Multi-head attention layer is one of the key components of the transformer architecture that sets it apart from traditional feed-forward models.
In this work, we initiate the study of provably learning a multi-head attention layer from random examples.
We prove computational lower bounds showing that in the worst case, exponential dependence on $m$ is unavoidable.
arXiv Detail & Related papers (2024-02-06T15:39:09Z) - An Over-parameterized Exponential Regression [18.57735939471469]
Recent developments in the field of Large Language Models (LLMs) have sparked interest in the use of exponential activation functions.
We define the neural function $F: mathbbRd times m times mathbbRd times mathbbRd times mathbbRd times mathbbRd times mathbbRd times mathbbRd times mathbbRd
arXiv Detail & Related papers (2023-03-29T07:29:07Z) - Learning a Single Neuron with Adversarial Label Noise via Gradient
Descent [50.659479930171585]
We study a function of the form $mathbfxmapstosigma(mathbfwcdotmathbfx)$ for monotone activations.
The goal of the learner is to output a hypothesis vector $mathbfw$ that $F(mathbbw)=C, epsilon$ with high probability.
arXiv Detail & Related papers (2022-06-17T17:55:43Z) - Overparametrized linear dimensionality reductions: From projection pursuit to two-layer neural networks [8.74634652691576]
We consider a cloud of $n data points in $mathbbRd$, consider all projections onto $m$-dimensional subspaces of $mathbbRd$ and, for each such projection, the empirical distribution of the projected points.<n>What does this collection of probability distributions look like when $n,d$ grow large?<n>We prove sharp bounds in terms of Kullback-Leibler divergence and R'enyi information dimension.
arXiv Detail & Related papers (2022-06-14T00:07:33Z) - Random matrices in service of ML footprint: ternary random features with
no performance loss [55.30329197651178]
We show that the eigenspectrum of $bf K$ is independent of the distribution of the i.i.d. entries of $bf w$.
We propose a novel random technique, called Ternary Random Feature (TRF)
The computation of the proposed random features requires no multiplication and a factor of $b$ less bits for storage compared to classical random features.
arXiv Detail & Related papers (2021-10-05T09:33:49Z) - Deformed semicircle law and concentration of nonlinear random matrices
for ultra-wide neural networks [29.03095282348978]
We study the limiting spectral distributions of two empirical kernel matrices associated with $f(X)$.
We show that random feature regression induced by the empirical kernel achieves the same performance as its limiting kernel regression under the ultra-wide regime.
arXiv Detail & Related papers (2021-09-20T05:25:52Z) - Spectral properties of sample covariance matrices arising from random
matrices with independent non identically distributed columns [50.053491972003656]
It was previously shown that the functionals $texttr(AR(z))$, for $R(z) = (frac1nXXT- zI_p)-1$ and $Ain mathcal M_p$ deterministic, have a standard deviation of order $O(|A|_* / sqrt n)$.
Here, we show that $|mathbb E[R(z)] - tilde R(z)|_F
arXiv Detail & Related papers (2021-09-06T14:21:43Z) - Stochastic behavior of outcome of Schur-Weyl duality measurement [45.41082277680607]
We focus on the measurement defined by the decomposition based on Schur-Weyl duality on $n$ qubits.
We derive various types of distribution including a kind of central limit when $n$ goes to infinity.
arXiv Detail & Related papers (2021-04-26T15:03:08Z) - Learning Over-Parametrized Two-Layer ReLU Neural Networks beyond NTK [58.5766737343951]
We consider the dynamic of descent for learning a two-layer neural network.
We show that an over-parametrized two-layer neural network can provably learn with gradient loss at most ground with Tangent samples.
arXiv Detail & Related papers (2020-07-09T07:09:28Z) - Agnostic Learning of a Single Neuron with Gradient Descent [92.7662890047311]
We consider the problem of learning the best-fitting single neuron as measured by the expected square loss.
For the ReLU activation, our population risk guarantee is $O(mathsfOPT1/2)+epsilon$.
For the ReLU activation, our population risk guarantee is $O(mathsfOPT1/2)+epsilon$.
arXiv Detail & Related papers (2020-05-29T07:20:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.