Related papers: Near-Interpolators: Rapid Norm Growth and the Trade-Off between Interpolation and Generalization

Near-Interpolators: Rapid Norm Growth and the Trade-Off between Interpolation and Generalization

URL: http://arxiv.org/abs/2403.07264v1
Date: Tue, 12 Mar 2024 02:47:00 GMT
Title: Near-Interpolators: Rapid Norm Growth and the Trade-Off between Interpolation and Generalization
Authors: Yutong Wang, Rishi Sonthalia, Wei Hu
Abstract summary: We study the generalization capability of nearly-interpolating linear regressors. For $tau$ fixed, $boldsymbolbeta$ has squared $ell$-norm $bbE[|boldsymbolbeta|_22]. We empirically that a similar phenomenon holds for nearly-interpolating shallow neural networks.
Score: 28.02367842438021
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We study the generalization capability of nearly-interpolating linear regressors: $\boldsymbol{\beta}$'s whose training error $\tau$ is positive but small, i.e., below the noise floor. Under a random matrix theoretic assumption on the data distribution and an eigendecay assumption on the data covariance matrix $\boldsymbol{\Sigma}$, we demonstrate that any near-interpolator exhibits rapid norm growth: for $\tau$ fixed, $\boldsymbol{\beta}$ has squared $\ell_2$-norm $\mathbb{E}[\|{\boldsymbol{\beta}}\|_{2}^{2}] = \Omega(n^{\alpha})$ where $n$ is the number of samples and $\alpha >1$ is the exponent of the eigendecay, i.e., $\lambda_i(\boldsymbol{\Sigma}) \sim i^{-\alpha}$. This implies that existing data-independent norm-based bounds are necessarily loose. On the other hand, in the same regime we precisely characterize the asymptotic trade-off between interpolation and generalization. Our characterization reveals that larger norm scaling exponents $\alpha$ correspond to worse trade-offs between interpolation and generalization. We verify empirically that a similar phenomenon holds for nearly-interpolating shallow neural networks.

Related papers

Dimension-free Private Mean Estimation for Anisotropic Distributions [55.86374912608193]
Previous private estimators on distributions over $mathRd suffer from a curse of dimensionality. We present an algorithm whose sample complexity has improved dependence on dimension.
arXiv Detail & Related papers (2024-11-01T17:59:53Z)
Neural network learns low-dimensional polynomials with SGD near the information-theoretic limit [75.4661041626338]
We study the problem of gradient descent learning of a single-index target function $f_*(boldsymbolx) = textstylesigma_*left(langleboldsymbolx,boldsymbolthetarangleright)$ We prove that a two-layer neural network optimized by an SGD-based algorithm learns $f_*$ with a complexity that is not governed by information exponents.
arXiv Detail & Related papers (2024-06-03T17:56:58Z)
Provably learning a multi-head attention layer [55.2904547651831]
Multi-head attention layer is one of the key components of the transformer architecture that sets it apart from traditional feed-forward models. In this work, we initiate the study of provably learning a multi-head attention layer from random examples. We prove computational lower bounds showing that in the worst case, exponential dependence on $m$ is unavoidable.
arXiv Detail & Related papers (2024-02-06T15:39:09Z)
Universality of max-margin classifiers [10.797131009370219]
We study the role of featurization maps and the high-dimensional universality of the misclassification error for non-Gaussian features. In particular, the overparametrization threshold and generalization error can be computed within a simpler model.
arXiv Detail & Related papers (2023-09-29T22:45:56Z)
A Unified Framework for Uniform Signal Recovery in Nonlinear Generative Compressed Sensing [68.80803866919123]
Under nonlinear measurements, most prior results are non-uniform, i.e., they hold with high probability for a fixed $mathbfx*$ rather than for all $mathbfx*$ simultaneously. Our framework accommodates GCS with 1-bit/uniformly quantized observations and single index models as canonical examples. We also develop a concentration inequality that produces tighter bounds for product processes whose index sets have low metric entropy.
arXiv Detail & Related papers (2023-09-25T17:54:19Z)
Learning a Single Neuron with Adversarial Label Noise via Gradient Descent [50.659479930171585]
We study a function of the form $mathbfxmapstosigma(mathbfwcdotmathbfx)$ for monotone activations. The goal of the learner is to output a hypothesis vector $mathbfw$ that $F(mathbbw)=C, epsilon$ with high probability.
arXiv Detail & Related papers (2022-06-17T17:55:43Z)
A Law of Robustness beyond Isoperimetry [84.33752026418045]
We prove a Lipschitzness lower bound $Omega(sqrtn/p)$ of robustness of interpolating neural network parameters on arbitrary distributions. We then show the potential benefit of overparametrization for smooth data when $n=mathrmpoly(d)$. We disprove the potential existence of an $O(1)$-Lipschitz robust interpolating function when $n=exp(omega(d))$.
arXiv Detail & Related papers (2022-02-23T16:10:23Z)
Universality of empirical risk minimization [12.764655736673749]
Consider supervised learning from i.i.d. samples where $boldsymbol x_i inmathbbRp$ are feature vectors and $y in mathbbR$ are labels. We study empirical risk universality over a class of functions that are parameterized by $mathsfk.
arXiv Detail & Related papers (2022-02-17T18:53:45Z)
Agnostic Learning of a Single Neuron with Gradient Descent [92.7662890047311]
We consider the problem of learning the best-fitting single neuron as measured by the expected square loss. For the ReLU activation, our population risk guarantee is $O(mathsfOPT1/2)+epsilon$. For the ReLU activation, our population risk guarantee is $O(mathsfOPT1/2)+epsilon$.
arXiv Detail & Related papers (2020-05-29T07:20:35Z)
On the robustness of the minimum $\ell_2$ interpolator [2.918940961856197]
We analyse the interpolator with minimal $ell$-norm $hatbeta$ in a general high dimensional linear regression framework. We prove that, with high probability, the prediction loss of this estimator is bounded from above by $(|beta*|2r_cn(Sigma)vee |xi|2)/n$, where $r_k(Sigma)sum_igeq klambda_i(Sigma)$ are the rests of the
arXiv Detail & Related papers (2020-03-12T15:12:28Z)
The generalization error of max-margin linear classifiers: Benign overfitting and high dimensional asymptotics in the overparametrized regime [11.252856459394854]
Modern machine learning classifiers often exhibit vanishing classification error on the training set. Motivated by these phenomena, we revisit high-dimensional maximum margin classification for linearly separable data.
arXiv Detail & Related papers (2019-11-05T00:15:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.