Generalization error of random features and kernel methods:
hypercontractivity and kernel matrix concentration
- URL: http://arxiv.org/abs/2101.10588v1
- Date: Tue, 26 Jan 2021 06:46:41 GMT
- Title: Generalization error of random features and kernel methods:
hypercontractivity and kernel matrix concentration
- Authors: Song Mei, Theodor Misiakiewicz, Andrea Montanari
- Abstract summary: We study the use of random features methods in conjunction with ridge regression in the feature space $mathbb RN$.
This can be viewed as a finite-dimensional approximation of kernel ridge regression (KRR), or as a stylized model for neural networks in the so called lazy training regime.
- Score: 19.78800773518545
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Consider the classical supervised learning problem: we are given data
$(y_i,{\boldsymbol x}_i)$, $i\le n$, with $y_i$ a response and ${\boldsymbol
x}_i\in {\mathcal X}$ a covariates vector, and try to learn a model
$f:{\mathcal X}\to{\mathbb R}$ to predict future responses. Random features
methods map the covariates vector ${\boldsymbol x}_i$ to a point ${\boldsymbol
\phi}({\boldsymbol x}_i)$ in a higher dimensional space ${\mathbb R}^N$, via a
random featurization map ${\boldsymbol \phi}$. We study the use of random
features methods in conjunction with ridge regression in the feature space
${\mathbb R}^N$. This can be viewed as a finite-dimensional approximation of
kernel ridge regression (KRR), or as a stylized model for neural networks in
the so called lazy training regime.
We define a class of problems satisfying certain spectral conditions on the
underlying kernels, and a hypercontractivity assumption on the associated
eigenfunctions. These conditions are verified by classical high-dimensional
examples. Under these conditions, we prove a sharp characterization of the
error of random features ridge regression. In particular, we address two
fundamental questions: $(1)$~What is the generalization error of KRR? $(2)$~How
big $N$ should be for the random features approximation to achieve the same
error as KRR?
In this setting, we prove that KRR is well approximated by a projection onto
the top $\ell$ eigenfunctions of the kernel, where $\ell$ depends on the sample
size $n$. We show that the test error of random features ridge regression is
dominated by its approximation error and is larger than the error of KRR as
long as $N\le n^{1-\delta}$ for some $\delta>0$. We characterize this gap. For
$N\ge n^{1+\delta}$, random features achieve the same error as the
corresponding KRR, and further increasing $N$ does not lead to a significant
change in test error.
Related papers
- Asymptotics of Random Feature Regression Beyond the Linear Scaling
Regime [22.666759017118796]
Recent advances in machine learning have been achieved by using overparametrized models trained until near the training data.
How does model complexity and generalization depend on the number of parameters $p$?
In particular, RFRR exhibits an intuitive trade-off between approximation and generalization power.
arXiv Detail & Related papers (2024-03-13T00:59:25Z) - Provably learning a multi-head attention layer [55.2904547651831]
Multi-head attention layer is one of the key components of the transformer architecture that sets it apart from traditional feed-forward models.
In this work, we initiate the study of provably learning a multi-head attention layer from random examples.
We prove computational lower bounds showing that in the worst case, exponential dependence on $m$ is unavoidable.
arXiv Detail & Related papers (2024-02-06T15:39:09Z) - Optimal Rates of Kernel Ridge Regression under Source Condition in Large
Dimensions [15.988264513040903]
We study the large-dimensional behavior of kernel ridge regression (KRR) where the sample size $n asymp dgamma$ for some $gamma > 0$.
Our results illustrate that the curves of rate varying along $gamma$ exhibit the periodic plateau behavior and the multiple descent behavior.
arXiv Detail & Related papers (2024-01-02T16:14:35Z) - A Unified Framework for Uniform Signal Recovery in Nonlinear Generative
Compressed Sensing [68.80803866919123]
Under nonlinear measurements, most prior results are non-uniform, i.e., they hold with high probability for a fixed $mathbfx*$ rather than for all $mathbfx*$ simultaneously.
Our framework accommodates GCS with 1-bit/uniformly quantized observations and single index models as canonical examples.
We also develop a concentration inequality that produces tighter bounds for product processes whose index sets have low metric entropy.
arXiv Detail & Related papers (2023-09-25T17:54:19Z) - Mean Estimation in High-Dimensional Binary Markov Gaussian Mixture
Models [12.746888269949407]
We consider a high-dimensional mean estimation problem over a binary hidden Markov model.
We establish a nearly minimax optimal (up to logarithmic factors) estimation error rate, as a function of $|theta_*|,delta,d,n$.
arXiv Detail & Related papers (2022-06-06T09:34:04Z) - Robust Testing in High-Dimensional Sparse Models [0.0]
We consider the problem of robustly testing the norm of a high-dimensional sparse signal vector under two different observation models.
We show that any algorithm that reliably tests the norm of the regression coefficient requires at least $n=Omegaleft(min(slog d,1/gamma4)right) samples.
arXiv Detail & Related papers (2022-05-16T07:47:22Z) - Universality of empirical risk minimization [12.764655736673749]
Consider supervised learning from i.i.d. samples where $boldsymbol x_i inmathbbRp$ are feature vectors and $y in mathbbR$ are labels.
We study empirical risk universality over a class of functions that are parameterized by $mathsfk.
arXiv Detail & Related papers (2022-02-17T18:53:45Z) - Random matrices in service of ML footprint: ternary random features with
no performance loss [55.30329197651178]
We show that the eigenspectrum of $bf K$ is independent of the distribution of the i.i.d. entries of $bf w$.
We propose a novel random technique, called Ternary Random Feature (TRF)
The computation of the proposed random features requires no multiplication and a factor of $b$ less bits for storage compared to classical random features.
arXiv Detail & Related papers (2021-10-05T09:33:49Z) - Spectral properties of sample covariance matrices arising from random
matrices with independent non identically distributed columns [50.053491972003656]
It was previously shown that the functionals $texttr(AR(z))$, for $R(z) = (frac1nXXT- zI_p)-1$ and $Ain mathcal M_p$ deterministic, have a standard deviation of order $O(|A|_* / sqrt n)$.
Here, we show that $|mathbb E[R(z)] - tilde R(z)|_F
arXiv Detail & Related papers (2021-09-06T14:21:43Z) - Optimal Spectral Recovery of a Planted Vector in a Subspace [80.02218763267992]
We study efficient estimation and detection of a planted vector $v$ whose $ell_4$ norm differs from that of a Gaussian vector with the same $ell$ norm.
We show that in the regime $n rho gg sqrtN$, any spectral method from a large class (and more generally, any low-degree of the input) fails to detect the planted vector.
arXiv Detail & Related papers (2021-05-31T16:10:49Z) - Linear Time Sinkhorn Divergences using Positive Features [51.50788603386766]
Solving optimal transport with an entropic regularization requires computing a $ntimes n$ kernel matrix that is repeatedly applied to a vector.
We propose to use instead ground costs of the form $c(x,y)=-logdotpvarphi(x)varphi(y)$ where $varphi$ is a map from the ground space onto the positive orthant $RRr_+$, with $rll n$.
arXiv Detail & Related papers (2020-06-12T10:21:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.