Spectrum of inner-product kernel matrices in the polynomial regime and
multiple descent phenomenon in kernel ridge regression
- URL: http://arxiv.org/abs/2204.10425v1
- Date: Thu, 21 Apr 2022 22:20:52 GMT
- Title: Spectrum of inner-product kernel matrices in the polynomial regime and
multiple descent phenomenon in kernel ridge regression
- Authors: Theodor Misiakiewicz
- Abstract summary: kernel matrix is well approximated by its degree-$ell$ approximation.
We show that the spectrum of the matrix converges in distribution to a Marchenko-Pastur law.
- Score: 3.997680012976965
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study the spectrum of inner-product kernel matrices, i.e., $n \times n$
matrices with entries $h (\langle \textbf{x}_i ,\textbf{x}_j \rangle/d)$ where
the $( \textbf{x}_i)_{i \leq n}$ are i.i.d.~random covariates in
$\mathbb{R}^d$. In the linear high-dimensional regime $n \asymp d$, it was
shown that these matrices are well approximated by their linearization, which
simplifies into the sum of a rescaled Wishart matrix and identity matrix. In
this paper, we generalize this decomposition to the polynomial high-dimensional
regime $n \asymp d^\ell,\ell \in \mathbb{N}$, for data uniformly distributed on
the sphere and hypercube. In this regime, the kernel matrix is well
approximated by its degree-$\ell$ polynomial approximation and can be
decomposed into a low-rank spike matrix, identity and a `Gegenbauer matrix'
with entries $Q_\ell (\langle \textbf{x}_i , \textbf{x}_j \rangle)$, where
$Q_\ell$ is the degree-$\ell$ Gegenbauer polynomial. We show that the spectrum
of the Gegenbauer matrix converges in distribution to a Marchenko-Pastur law.
This problem is motivated by the study of the prediction error of kernel
ridge regression (KRR) in the polynomial regime $n \asymp d^\kappa, \kappa >0$.
Previous work showed that for $\kappa \not\in \mathbb{N}$, KRR fits exactly a
degree-$\lfloor \kappa \rfloor$ polynomial approximation to the target
function. In this paper, we use our characterization of the kernel matrix to
complete this picture and compute the precise asymptotics of the test error in
the limit $n/d^\kappa \to \psi$ with $\kappa \in \mathbb{N}$. In this case, the
test error can present a double descent behavior, depending on the effective
regularization and signal-to-noise ratio at level $\kappa$. Because this double
descent can occur each time $\kappa$ crosses an integer, this explains the
multiple descent phenomenon in the KRR risk curve observed in several previous
works.
Related papers
- The Communication Complexity of Approximating Matrix Rank [50.6867896228563]
We show that this problem has randomized communication complexity $Omega(frac1kcdot n2log|mathbbF|)$.
As an application, we obtain an $Omega(frac1kcdot n2log|mathbbF|)$ space lower bound for any streaming algorithm with $k$ passes.
arXiv Detail & Related papers (2024-10-26T06:21:42Z) - Optimal Sketching for Residual Error Estimation for Matrix and Vector Norms [50.15964512954274]
We study the problem of residual error estimation for matrix and vector norms using a linear sketch.
We demonstrate that this gives a substantial advantage empirically, for roughly the same sketch size and accuracy as in previous work.
We also show an $Omega(k2/pn1-2/p)$ lower bound for the sparse recovery problem, which is tight up to a $mathrmpoly(log n)$ factor.
arXiv Detail & Related papers (2024-08-16T02:33:07Z) - Provably learning a multi-head attention layer [55.2904547651831]
Multi-head attention layer is one of the key components of the transformer architecture that sets it apart from traditional feed-forward models.
In this work, we initiate the study of provably learning a multi-head attention layer from random examples.
We prove computational lower bounds showing that in the worst case, exponential dependence on $m$ is unavoidable.
arXiv Detail & Related papers (2024-02-06T15:39:09Z) - Universality for the global spectrum of random inner-product kernel
matrices in the polynomial regime [12.221087476416056]
In this paper, we show that this phenomenon is universal, holding as soon as $X$ has i.i.d. entries with all finite moments.
In the case of non-integer $ell$, the Marvcenko-Pastur term disappears.
arXiv Detail & Related papers (2023-10-27T17:15:55Z) - Data-Driven Linear Complexity Low-Rank Approximation of General Kernel
Matrices: A Geometric Approach [0.9453554184019107]
A kernel matrix may be defined as $K_ij = kappa(x_i,y_j)$ where $kappa(x,y)$ is a kernel function.
We seek a low-rank approximation to a kernel matrix where the sets of points $X$ and $Y$ are large.
arXiv Detail & Related papers (2022-12-24T07:15:00Z) - Learning a Single Neuron with Adversarial Label Noise via Gradient
Descent [50.659479930171585]
We study a function of the form $mathbfxmapstosigma(mathbfwcdotmathbfx)$ for monotone activations.
The goal of the learner is to output a hypothesis vector $mathbfw$ that $F(mathbbw)=C, epsilon$ with high probability.
arXiv Detail & Related papers (2022-06-17T17:55:43Z) - Perturbation Analysis of Randomized SVD and its Applications to
High-dimensional Statistics [8.90202564665576]
We study the statistical properties of RSVD under a general "signal-plus-noise" framework.
We derive nearly-optimal performance guarantees for RSVD when applied to three statistical inference problems.
arXiv Detail & Related papers (2022-03-19T07:26:45Z) - On Fast Johnson-Lindernstrauss Embeddings of Compact Submanifolds of
$\mathbb{R}^N$ with Boundary [0.4125187280299246]
We consider the probability that a random matrix $A in mathbbRm times N$ will serve as a bi-Lipschitz function $A: mathcalM rightarrow mathbbRm$ with bi-Lipschitz constants close to one.
We present a new class of highly structured distributions for embedding sufficiently low-dimensional submanifolds of $mathbbRN$.
arXiv Detail & Related papers (2021-10-08T15:27:52Z) - Spectral properties of sample covariance matrices arising from random
matrices with independent non identically distributed columns [50.053491972003656]
It was previously shown that the functionals $texttr(AR(z))$, for $R(z) = (frac1nXXT- zI_p)-1$ and $Ain mathcal M_p$ deterministic, have a standard deviation of order $O(|A|_* / sqrt n)$.
Here, we show that $|mathbb E[R(z)] - tilde R(z)|_F
arXiv Detail & Related papers (2021-09-06T14:21:43Z) - Two-way kernel matrix puncturing: towards resource-efficient PCA and
spectral clustering [43.50783459690612]
The method consists in randomly "puncturing" both the data matrix $XinmathbbCptimes n$ and its corresponding kernel (Gram) matrix $K$ through Bernoulli masks.
We empirically confirm on GAN-generated image databases, that it is possible to drastically puncture the data, thereby providing possibly huge computational and storage gains.
arXiv Detail & Related papers (2021-02-24T14:01:58Z) - Linear Time Sinkhorn Divergences using Positive Features [51.50788603386766]
Solving optimal transport with an entropic regularization requires computing a $ntimes n$ kernel matrix that is repeatedly applied to a vector.
We propose to use instead ground costs of the form $c(x,y)=-logdotpvarphi(x)varphi(y)$ where $varphi$ is a map from the ground space onto the positive orthant $RRr_+$, with $rll n$.
arXiv Detail & Related papers (2020-06-12T10:21:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.