Asymptotics of Random Feature Regression Beyond the Linear Scaling
Regime
- URL: http://arxiv.org/abs/2403.08160v1
- Date: Wed, 13 Mar 2024 00:59:25 GMT
- Title: Asymptotics of Random Feature Regression Beyond the Linear Scaling
Regime
- Authors: Hong Hu, Yue M. Lu, Theodor Misiakiewicz
- Abstract summary: Recent advances in machine learning have been achieved by using overparametrized models trained until near the training data.
How does model complexity and generalization depend on the number of parameters $p$?
In particular, RFRR exhibits an intuitive trade-off between approximation and generalization power.
- Score: 22.666759017118796
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advances in machine learning have been achieved by using
overparametrized models trained until near interpolation of the training data.
It was shown, e.g., through the double descent phenomenon, that the number of
parameters is a poor proxy for the model complexity and generalization
capabilities. This leaves open the question of understanding the impact of
parametrization on the performance of these models. How does model complexity
and generalization depend on the number of parameters $p$? How should we choose
$p$ relative to the sample size $n$ to achieve optimal test error?
In this paper, we investigate the example of random feature ridge regression
(RFRR). This model can be seen either as a finite-rank approximation to kernel
ridge regression (KRR), or as a simplified model for neural networks trained in
the so-called lazy regime. We consider covariates uniformly distributed on the
$d$-dimensional sphere and compute sharp asymptotics for the RFRR test error in
the high-dimensional polynomial scaling, where $p,n,d \to \infty$ while $p/
d^{\kappa_1}$ and $n / d^{\kappa_2}$ stay constant, for all $\kappa_1 ,
\kappa_2 \in \mathbb{R}_{>0}$. These asymptotics precisely characterize the
impact of the number of random features and regularization parameter on the
test performance. In particular, RFRR exhibits an intuitive trade-off between
approximation and generalization power. For $n = o(p)$, the sample size $n$ is
the bottleneck and RFRR achieves the same performance as KRR (which is
equivalent to taking $p = \infty$). On the other hand, if $p = o(n)$, the
number of random features $p$ is the limiting factor and RFRR test error
matches the approximation error of the random feature model class (akin to
taking $n = \infty$). Finally, a double descent appears at $n= p$, a phenomenon
that was previously only characterized in the linear scaling $\kappa_1 =
\kappa_2 = 1$.
Related papers
- Scaling Laws in Linear Regression: Compute, Parameters, and Data [86.48154162485712]
We study the theory of scaling laws in an infinite dimensional linear regression setup.
We show that the reducible part of the test error is $Theta(-(a-1) + N-(a-1)/a)$.
Our theory is consistent with the empirical neural scaling laws and verified by numerical simulation.
arXiv Detail & Related papers (2024-06-12T17:53:29Z) - A Unified Framework for Uniform Signal Recovery in Nonlinear Generative
Compressed Sensing [68.80803866919123]
Under nonlinear measurements, most prior results are non-uniform, i.e., they hold with high probability for a fixed $mathbfx*$ rather than for all $mathbfx*$ simultaneously.
Our framework accommodates GCS with 1-bit/uniformly quantized observations and single index models as canonical examples.
We also develop a concentration inequality that produces tighter bounds for product processes whose index sets have low metric entropy.
arXiv Detail & Related papers (2023-09-25T17:54:19Z) - Effective Minkowski Dimension of Deep Nonparametric Regression: Function
Approximation and Statistical Theories [70.90012822736988]
Existing theories on deep nonparametric regression have shown that when the input data lie on a low-dimensional manifold, deep neural networks can adapt to intrinsic data structures.
This paper introduces a relaxed assumption that input data are concentrated around a subset of $mathbbRd$ denoted by $mathcalS$, and the intrinsic dimension $mathcalS$ can be characterized by a new complexity notation -- effective Minkowski dimension.
arXiv Detail & Related papers (2023-06-26T17:13:31Z) - High-dimensional Asymptotics of Feature Learning: How One Gradient Step
Improves the Representation [89.21686761957383]
We study the first gradient descent step on the first-layer parameters $boldsymbolW$ in a two-layer network.
Our results demonstrate that even one step can lead to a considerable advantage over random features.
arXiv Detail & Related papers (2022-05-03T12:09:59Z) - Universality of empirical risk minimization [12.764655736673749]
Consider supervised learning from i.i.d. samples where $boldsymbol x_i inmathbbRp$ are feature vectors and $y in mathbbR$ are labels.
We study empirical risk universality over a class of functions that are parameterized by $mathsfk.
arXiv Detail & Related papers (2022-02-17T18:53:45Z) - Random matrices in service of ML footprint: ternary random features with
no performance loss [55.30329197651178]
We show that the eigenspectrum of $bf K$ is independent of the distribution of the i.i.d. entries of $bf w$.
We propose a novel random technique, called Ternary Random Feature (TRF)
The computation of the proposed random features requires no multiplication and a factor of $b$ less bits for storage compared to classical random features.
arXiv Detail & Related papers (2021-10-05T09:33:49Z) - Generalization error of random features and kernel methods:
hypercontractivity and kernel matrix concentration [19.78800773518545]
We study the use of random features methods in conjunction with ridge regression in the feature space $mathbb RN$.
This can be viewed as a finite-dimensional approximation of kernel ridge regression (KRR), or as a stylized model for neural networks in the so called lazy training regime.
arXiv Detail & Related papers (2021-01-26T06:46:41Z) - Estimating Stochastic Linear Combination of Non-linear Regressions
Efficiently and Scalably [23.372021234032363]
We show that when the sub-sample sizes are large then the estimation errors will be sacrificed by too much.
To the best of our knowledge, this is the first work that and guarantees for the lineartext+Stochasticity model.
arXiv Detail & Related papers (2020-10-19T07:15:38Z) - Optimal Robust Linear Regression in Nearly Linear Time [97.11565882347772]
We study the problem of high-dimensional robust linear regression where a learner is given access to $n$ samples from the generative model $Y = langle X,w* rangle + epsilon$
We propose estimators for this problem under two settings: (i) $X$ is L4-L2 hypercontractive, $mathbbE [XXtop]$ has bounded condition number and $epsilon$ has bounded variance and (ii) $X$ is sub-Gaussian with identity second moment and $epsilon$ is
arXiv Detail & Related papers (2020-07-16T06:44:44Z) - Implicit Regularization of Random Feature Models [10.739602293023058]
We investigate the connection between Random Feature (RF) models and Kernel Ridge Regression (KRR)
We show that the average RF predictor is close to a KRR predictor with an effective ridge $tildelambda$.
We empirically find an extremely good agreement between the test errors of the average $lambda$-RF predictor and $tildelambda$-KRR predictor.
arXiv Detail & Related papers (2020-02-19T19:36:23Z) - Does generalization performance of $l^q$ regularization learning depend
on $q$? A negative example [19.945160684285003]
$lq$-regularization has been demonstrated to be an attractive technique in machine learning and statistical modeling.
We show that all $lq$ estimators for $0 infty$ attain similar generalization error bounds.
This finding tentatively reveals that, in some modeling contexts, the choice of $q$ might not have a strong impact in terms of the generalization capability.
arXiv Detail & Related papers (2013-07-25T00:48:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.