Implicit Regularization of Random Feature Models
- URL: http://arxiv.org/abs/2002.08404v2
- Date: Wed, 23 Sep 2020 10:29:43 GMT
- Title: Implicit Regularization of Random Feature Models
- Authors: Arthur Jacot, Berfin \c{S}im\c{s}ek, Francesco Spadaro, Cl\'ement
Hongler, Franck Gabriel
- Abstract summary: We investigate the connection between Random Feature (RF) models and Kernel Ridge Regression (KRR)
We show that the average RF predictor is close to a KRR predictor with an effective ridge $tildelambda$.
We empirically find an extremely good agreement between the test errors of the average $lambda$-RF predictor and $tildelambda$-KRR predictor.
- Score: 10.739602293023058
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Random Feature (RF) models are used as efficient parametric approximations of
kernel methods. We investigate, by means of random matrix theory, the
connection between Gaussian RF models and Kernel Ridge Regression (KRR). For a
Gaussian RF model with $P$ features, $N$ data points, and a ridge $\lambda$, we
show that the average (i.e. expected) RF predictor is close to a KRR predictor
with an effective ridge $\tilde{\lambda}$. We show that $\tilde{\lambda} >
\lambda$ and $\tilde{\lambda} \searrow \lambda$ monotonically as $P$ grows,
thus revealing the implicit regularization effect of finite RF sampling. We
then compare the risk (i.e. test error) of the $\tilde{\lambda}$-KRR predictor
with the average risk of the $\lambda$-RF predictor and obtain a precise and
explicit bound on their difference. Finally, we empirically find an extremely
good agreement between the test errors of the average $\lambda$-RF predictor
and $\tilde{\lambda}$-KRR predictor.
Related papers
- Variable Selection in Convex Piecewise Linear Regression [5.366354612549172]
This paper presents Sparse Gradient as a solution for variable selection in convex piecewise linear regression.
A non-asymptotic local convergence analysis is provided for SpGD under subGaussian noise.
arXiv Detail & Related papers (2024-11-04T16:19:09Z) - Asymptotics of Random Feature Regression Beyond the Linear Scaling
Regime [22.666759017118796]
Recent advances in machine learning have been achieved by using overparametrized models trained until near the training data.
How does model complexity and generalization depend on the number of parameters $p$?
In particular, RFRR exhibits an intuitive trade-off between approximation and generalization power.
arXiv Detail & Related papers (2024-03-13T00:59:25Z) - Minimax Optimality of Score-based Diffusion Models: Beyond the Density Lower Bound Assumptions [11.222970035173372]
kernel-based score estimator achieves an optimal mean square error of $widetildeOleft(n-1 t-fracd+22(tfracd2 vee 1)right)
We show that a kernel-based score estimator achieves an optimal mean square error of $widetildeOleft(n-1/2 t-fracd4right)$ upper bound for the total variation error of the distribution of the sample generated by the diffusion model under a mere sub-Gaussian
arXiv Detail & Related papers (2024-02-23T20:51:31Z) - $L^1$ Estimation: On the Optimality of Linear Estimators [64.76492306585168]
This work shows that the only prior distribution on $X$ that induces linearity in the conditional median is Gaussian.
In particular, it is demonstrated that if the conditional distribution $P_X|Y=y$ is symmetric for all $y$, then $X$ must follow a Gaussian distribution.
arXiv Detail & Related papers (2023-09-17T01:45:13Z) - Approximate Function Evaluation via Multi-Armed Bandits [51.146684847667125]
We study the problem of estimating the value of a known smooth function $f$ at an unknown point $boldsymbolmu in mathbbRn$, where each component $mu_i$ can be sampled via a noisy oracle.
We design an instance-adaptive algorithm that learns to sample according to the importance of each coordinate, and with probability at least $1-delta$ returns an $epsilon$ accurate estimate of $f(boldsymbolmu)$.
arXiv Detail & Related papers (2022-03-18T18:50:52Z) - Failure and success of the spectral bias prediction for Kernel Ridge
Regression: the case of low-dimensional data [0.28647133890966986]
In some regimes, they predict that the method has a spectral bias': decomposing the true function $f*$ on the eigenbasis of the kernel.
This prediction works very well on benchmark data sets such as images, yet the assumptions these approaches make on the data are never satisfied in practice.
arXiv Detail & Related papers (2022-02-07T16:48:14Z) - Polyak-Ruppert Averaged Q-Leaning is Statistically Efficient [90.14768299744792]
We study synchronous Q-learning with Polyak-Ruppert averaging (a.k.a., averaged Q-leaning) in a $gamma$-discounted MDP.
We establish normality for the iteration averaged $barboldsymbolQ_T$.
In short, our theoretical analysis shows averaged Q-Leaning is statistically efficient.
arXiv Detail & Related papers (2021-12-29T14:47:56Z) - Generalization error of random features and kernel methods:
hypercontractivity and kernel matrix concentration [19.78800773518545]
We study the use of random features methods in conjunction with ridge regression in the feature space $mathbb RN$.
This can be viewed as a finite-dimensional approximation of kernel ridge regression (KRR), or as a stylized model for neural networks in the so called lazy training regime.
arXiv Detail & Related papers (2021-01-26T06:46:41Z) - Optimal Robust Linear Regression in Nearly Linear Time [97.11565882347772]
We study the problem of high-dimensional robust linear regression where a learner is given access to $n$ samples from the generative model $Y = langle X,w* rangle + epsilon$
We propose estimators for this problem under two settings: (i) $X$ is L4-L2 hypercontractive, $mathbbE [XXtop]$ has bounded condition number and $epsilon$ has bounded variance and (ii) $X$ is sub-Gaussian with identity second moment and $epsilon$ is
arXiv Detail & Related papers (2020-07-16T06:44:44Z) - Sharp Statistical Guarantees for Adversarially Robust Gaussian
Classification [54.22421582955454]
We provide the first result of the optimal minimax guarantees for the excess risk for adversarially robust classification.
Results are stated in terms of the Adversarial Signal-to-Noise Ratio (AdvSNR), which generalizes a similar notion for standard linear classification to the adversarial setting.
arXiv Detail & Related papers (2020-06-29T21:06:52Z) - Curse of Dimensionality on Randomized Smoothing for Certifiable
Robustness [151.67113334248464]
We show that extending the smoothing technique to defend against other attack models can be challenging.
We present experimental results on CIFAR to validate our theory.
arXiv Detail & Related papers (2020-02-08T22:02:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.