Failure and success of the spectral bias prediction for Kernel Ridge
Regression: the case of low-dimensional data
- URL: http://arxiv.org/abs/2202.03348v1
- Date: Mon, 7 Feb 2022 16:48:14 GMT
- Title: Failure and success of the spectral bias prediction for Kernel Ridge
Regression: the case of low-dimensional data
- Authors: Umberto M. Tomasini, Antonio Sclocchi, Matthieu Wyart
- Abstract summary: In some regimes, they predict that the method has a spectral bias': decomposing the true function $f*$ on the eigenbasis of the kernel.
This prediction works very well on benchmark data sets such as images, yet the assumptions these approaches make on the data are never satisfied in practice.
- Score: 0.28647133890966986
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, several theories including the replica method made predictions for
the generalization error of Kernel Ridge Regression. In some regimes, they
predict that the method has a `spectral bias': decomposing the true function
$f^*$ on the eigenbasis of the kernel, it fits well the coefficients associated
with the O(P) largest eigenvalues, where $P$ is the size of the training set.
This prediction works very well on benchmark data sets such as images, yet the
assumptions these approaches make on the data are never satisfied in practice.
To clarify when the spectral bias prediction holds, we first focus on a
one-dimensional model where rigorous results are obtained and then use scaling
arguments to generalize and test our findings in higher dimensions. Our
predictions include the classification case $f(x)=$sign$(x_1)$ with a data
distribution that vanishes at the decision boundary $p(x)\sim x_1^{\chi}$. For
$\chi>0$ and a Laplace kernel, we find that (i) there exists a cross-over ridge
$\lambda^*_{d,\chi}(P)\sim P^{-\frac{1}{d+\chi}}$ such that for $\lambda\gg
\lambda^*_{d,\chi}(P)$, the replica method applies, but not for
$\lambda\ll\lambda^*_{d,\chi}(P)$, (ii) in the ridge-less case, spectral bias
predicts the correct training curve exponent only in the limit
$d\rightarrow\infty$.
Related papers
- Optimal Rates of Kernel Ridge Regression under Source Condition in Large
Dimensions [15.988264513040903]
We study the large-dimensional behavior of kernel ridge regression (KRR) where the sample size $n asymp dgamma$ for some $gamma > 0$.
Our results illustrate that the curves of rate varying along $gamma$ exhibit the periodic plateau behavior and the multiple descent behavior.
arXiv Detail & Related papers (2024-01-02T16:14:35Z) - A Unified Framework for Uniform Signal Recovery in Nonlinear Generative
Compressed Sensing [68.80803866919123]
Under nonlinear measurements, most prior results are non-uniform, i.e., they hold with high probability for a fixed $mathbfx*$ rather than for all $mathbfx*$ simultaneously.
Our framework accommodates GCS with 1-bit/uniformly quantized observations and single index models as canonical examples.
We also develop a concentration inequality that produces tighter bounds for product processes whose index sets have low metric entropy.
arXiv Detail & Related papers (2023-09-25T17:54:19Z) - $L^1$ Estimation: On the Optimality of Linear Estimators [64.76492306585168]
This work shows that the only prior distribution on $X$ that induces linearity in the conditional median is Gaussian.
In particular, it is demonstrated that if the conditional distribution $P_X|Y=y$ is symmetric for all $y$, then $X$ must follow a Gaussian distribution.
arXiv Detail & Related papers (2023-09-17T01:45:13Z) - Dimension free ridge regression [10.434481202633458]
We revisit ridge regression on i.i.d. data in terms of the bias and variance of ridge regression in terms of the bias and variance of an equivalent' sequence model.
As a new application, we obtain a completely explicit and sharp characterization of ridge regression for Hilbert covariates with regularly varying spectrum.
arXiv Detail & Related papers (2022-10-16T16:01:05Z) - $p$-Generalized Probit Regression and Scalable Maximum Likelihood
Estimation via Sketching and Coresets [74.37849422071206]
We study the $p$-generalized probit regression model, which is a generalized linear model for binary responses.
We show how the maximum likelihood estimator for $p$-generalized probit regression can be approximated efficiently up to a factor of $(1+varepsilon)$ on large data.
arXiv Detail & Related papers (2022-03-25T10:54:41Z) - Random matrices in service of ML footprint: ternary random features with
no performance loss [55.30329197651178]
We show that the eigenspectrum of $bf K$ is independent of the distribution of the i.i.d. entries of $bf w$.
We propose a novel random technique, called Ternary Random Feature (TRF)
The computation of the proposed random features requires no multiplication and a factor of $b$ less bits for storage compared to classical random features.
arXiv Detail & Related papers (2021-10-05T09:33:49Z) - Locality defeats the curse of dimensionality in convolutional
teacher-student scenarios [69.2027612631023]
We show that locality is key in determining the learning curve exponent $beta$.
We conclude by proving, using a natural assumption, that performing kernel regression with a ridge that decreases with the size of the training set leads to similar learning curve exponents to those we obtain in the ridgeless case.
arXiv Detail & Related papers (2021-06-16T08:27:31Z) - Estimating Stochastic Linear Combination of Non-linear Regressions
Efficiently and Scalably [23.372021234032363]
We show that when the sub-sample sizes are large then the estimation errors will be sacrificed by too much.
To the best of our knowledge, this is the first work that and guarantees for the lineartext+Stochasticity model.
arXiv Detail & Related papers (2020-10-19T07:15:38Z) - Out-of-sample error estimate for robust M-estimators with convex penalty [5.33024001730262]
A generic out-of-sample error estimate is proposed for robust $M$-estimators regularized with a convex penalty.
General differentiable loss functions $psi$ are allowed provided that $psi=rho'$ is 1-Lipschitz.
arXiv Detail & Related papers (2020-08-26T21:50:41Z) - Truncated Linear Regression in High Dimensions [26.41623833920794]
In truncated linear regression, $(A_i, y_i)_i$ whose dependent variable equals $y_i= A_irm T cdot x* + eta_i$ is some fixed unknown vector of interest.
The goal is to recover $x*$ under some favorable conditions on the $A_i$'s and the noise distribution.
We prove that there exists a computationally and statistically efficient method for recovering $k$-sparse $n$-dimensional vectors $x*$ from $m$ truncated samples.
arXiv Detail & Related papers (2020-07-29T00:31:34Z) - Linear Time Sinkhorn Divergences using Positive Features [51.50788603386766]
Solving optimal transport with an entropic regularization requires computing a $ntimes n$ kernel matrix that is repeatedly applied to a vector.
We propose to use instead ground costs of the form $c(x,y)=-logdotpvarphi(x)varphi(y)$ where $varphi$ is a map from the ground space onto the positive orthant $RRr_+$, with $rll n$.
arXiv Detail & Related papers (2020-06-12T10:21:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.