Near-Interpolators: Rapid Norm Growth and the Trade-Off between
Interpolation and Generalization
- URL: http://arxiv.org/abs/2403.07264v1
- Date: Tue, 12 Mar 2024 02:47:00 GMT
- Title: Near-Interpolators: Rapid Norm Growth and the Trade-Off between
Interpolation and Generalization
- Authors: Yutong Wang, Rishi Sonthalia, Wei Hu
- Abstract summary: We study the generalization capability of nearly-interpolating linear regressors.
For $tau$ fixed, $boldsymbolbeta$ has squared $ell$-norm $bbE[|boldsymbolbeta|_22].
We empirically that a similar phenomenon holds for nearly-interpolating shallow neural networks.
- Score: 28.02367842438021
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We study the generalization capability of nearly-interpolating linear
regressors: $\boldsymbol{\beta}$'s whose training error $\tau$ is positive but
small, i.e., below the noise floor. Under a random matrix theoretic assumption
on the data distribution and an eigendecay assumption on the data covariance
matrix $\boldsymbol{\Sigma}$, we demonstrate that any near-interpolator
exhibits rapid norm growth: for $\tau$ fixed, $\boldsymbol{\beta}$ has squared
$\ell_2$-norm $\mathbb{E}[\|{\boldsymbol{\beta}}\|_{2}^{2}] =
\Omega(n^{\alpha})$ where $n$ is the number of samples and $\alpha >1$ is the
exponent of the eigendecay, i.e., $\lambda_i(\boldsymbol{\Sigma}) \sim
i^{-\alpha}$. This implies that existing data-independent norm-based bounds are
necessarily loose. On the other hand, in the same regime we precisely
characterize the asymptotic trade-off between interpolation and generalization.
Our characterization reveals that larger norm scaling exponents $\alpha$
correspond to worse trade-offs between interpolation and generalization. We
verify empirically that a similar phenomenon holds for nearly-interpolating
shallow neural networks.
Related papers
- Dimension-free Private Mean Estimation for Anisotropic Distributions [55.86374912608193]
Previous private estimators on distributions over $mathRd suffer from a curse of dimensionality.
We present an algorithm whose sample complexity has improved dependence on dimension.
arXiv Detail & Related papers (2024-11-01T17:59:53Z) - Provably learning a multi-head attention layer [55.2904547651831]
Multi-head attention layer is one of the key components of the transformer architecture that sets it apart from traditional feed-forward models.
In this work, we initiate the study of provably learning a multi-head attention layer from random examples.
We prove computational lower bounds showing that in the worst case, exponential dependence on $m$ is unavoidable.
arXiv Detail & Related papers (2024-02-06T15:39:09Z) - Universality of max-margin classifiers [10.797131009370219]
We study the role of featurization maps and the high-dimensional universality of the misclassification error for non-Gaussian features.
In particular, the overparametrization threshold and generalization error can be computed within a simpler model.
arXiv Detail & Related papers (2023-09-29T22:45:56Z) - A Unified Framework for Uniform Signal Recovery in Nonlinear Generative
Compressed Sensing [68.80803866919123]
Under nonlinear measurements, most prior results are non-uniform, i.e., they hold with high probability for a fixed $mathbfx*$ rather than for all $mathbfx*$ simultaneously.
Our framework accommodates GCS with 1-bit/uniformly quantized observations and single index models as canonical examples.
We also develop a concentration inequality that produces tighter bounds for product processes whose index sets have low metric entropy.
arXiv Detail & Related papers (2023-09-25T17:54:19Z) - Learning a Single Neuron with Adversarial Label Noise via Gradient
Descent [50.659479930171585]
We study a function of the form $mathbfxmapstosigma(mathbfwcdotmathbfx)$ for monotone activations.
The goal of the learner is to output a hypothesis vector $mathbfw$ that $F(mathbbw)=C, epsilon$ with high probability.
arXiv Detail & Related papers (2022-06-17T17:55:43Z) - A Law of Robustness beyond Isoperimetry [84.33752026418045]
We prove a Lipschitzness lower bound $Omega(sqrtn/p)$ of robustness of interpolating neural network parameters on arbitrary distributions.
We then show the potential benefit of overparametrization for smooth data when $n=mathrmpoly(d)$.
We disprove the potential existence of an $O(1)$-Lipschitz robust interpolating function when $n=exp(omega(d))$.
arXiv Detail & Related papers (2022-02-23T16:10:23Z) - Universality of empirical risk minimization [12.764655736673749]
Consider supervised learning from i.i.d. samples where $boldsymbol x_i inmathbbRp$ are feature vectors and $y in mathbbR$ are labels.
We study empirical risk universality over a class of functions that are parameterized by $mathsfk.
arXiv Detail & Related papers (2022-02-17T18:53:45Z) - Agnostic Learning of a Single Neuron with Gradient Descent [92.7662890047311]
We consider the problem of learning the best-fitting single neuron as measured by the expected square loss.
For the ReLU activation, our population risk guarantee is $O(mathsfOPT1/2)+epsilon$.
For the ReLU activation, our population risk guarantee is $O(mathsfOPT1/2)+epsilon$.
arXiv Detail & Related papers (2020-05-29T07:20:35Z) - On the robustness of the minimum $\ell_2$ interpolator [2.918940961856197]
We analyse the interpolator with minimal $ell$-norm $hatbeta$ in a general high dimensional linear regression framework.
We prove that, with high probability, the prediction loss of this estimator is bounded from above by $(|beta*|2r_cn(Sigma)vee |xi|2)/n$, where $r_k(Sigma)sum_igeq klambda_i(Sigma)$ are the rests of the
arXiv Detail & Related papers (2020-03-12T15:12:28Z) - The generalization error of max-margin linear classifiers: Benign
overfitting and high dimensional asymptotics in the overparametrized regime [11.252856459394854]
Modern machine learning classifiers often exhibit vanishing classification error on the training set.
Motivated by these phenomena, we revisit high-dimensional maximum margin classification for linearly separable data.
arXiv Detail & Related papers (2019-11-05T00:15:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.