The generalization error of max-margin linear classifiers: Benign
overfitting and high dimensional asymptotics in the overparametrized regime
- URL: http://arxiv.org/abs/1911.01544v3
- Date: Wed, 22 Mar 2023 16:53:25 GMT
- Title: The generalization error of max-margin linear classifiers: Benign
overfitting and high dimensional asymptotics in the overparametrized regime
- Authors: Andrea Montanari, Feng Ruan, Youngtak Sohn, Jun Yan
- Abstract summary: Modern machine learning classifiers often exhibit vanishing classification error on the training set.
Motivated by these phenomena, we revisit high-dimensional maximum margin classification for linearly separable data.
- Score: 11.252856459394854
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Modern machine learning classifiers often exhibit vanishing classification
error on the training set. They achieve this by learning nonlinear
representations of the inputs that maps the data into linearly separable
classes.
Motivated by these phenomena, we revisit high-dimensional maximum margin
classification for linearly separable data. We consider a stylized setting in
which data $(y_i,{\boldsymbol x}_i)$, $i\le n$ are i.i.d. with ${\boldsymbol
x}_i\sim\mathsf{N}({\boldsymbol 0},{\boldsymbol \Sigma})$ a $p$-dimensional
Gaussian feature vector, and $y_i \in\{+1,-1\}$ a label whose distribution
depends on a linear combination of the covariates $\langle {\boldsymbol
\theta}_*,{\boldsymbol x}_i \rangle$. While the Gaussian model might appear
extremely simplistic, universality arguments can be used to show that the
results derived in this setting also apply to the output of certain nonlinear
featurization maps.
We consider the proportional asymptotics $n,p\to\infty$ with $p/n\to \psi$,
and derive exact expressions for the limiting generalization error. We use this
theory to derive two results of independent interest: $(i)$ Sufficient
conditions on $({\boldsymbol \Sigma},{\boldsymbol \theta}_*)$ for `benign
overfitting' that parallel previously derived conditions in the case of linear
regression; $(ii)$ An asymptotically exact expression for the generalization
error when max-margin classification is used in conjunction with feature
vectors produced by random one-layer neural networks.
Related papers
- Near-Interpolators: Rapid Norm Growth and the Trade-Off between
Interpolation and Generalization [28.02367842438021]
We study the generalization capability of nearly-interpolating linear regressors.
For $tau$ fixed, $boldsymbolbeta$ has squared $ell$-norm $bbE[|boldsymbolbeta|_22].
We empirically that a similar phenomenon holds for nearly-interpolating shallow neural networks.
arXiv Detail & Related papers (2024-03-12T02:47:00Z) - Universality of max-margin classifiers [10.797131009370219]
We study the role of featurization maps and the high-dimensional universality of the misclassification error for non-Gaussian features.
In particular, the overparametrization threshold and generalization error can be computed within a simpler model.
arXiv Detail & Related papers (2023-09-29T22:45:56Z) - A Unified Framework for Uniform Signal Recovery in Nonlinear Generative
Compressed Sensing [68.80803866919123]
Under nonlinear measurements, most prior results are non-uniform, i.e., they hold with high probability for a fixed $mathbfx*$ rather than for all $mathbfx*$ simultaneously.
Our framework accommodates GCS with 1-bit/uniformly quantized observations and single index models as canonical examples.
We also develop a concentration inequality that produces tighter bounds for product processes whose index sets have low metric entropy.
arXiv Detail & Related papers (2023-09-25T17:54:19Z) - Repeated Observations for Classification [0.2676349883103404]
We study the problem nonparametric classification with repeated observations.
In the analysis, we investigate particular models like robust detection by nominal densities, prototype classification, linear transformation, linear classification, scaling.
arXiv Detail & Related papers (2023-07-19T10:50:36Z) - Dimension free ridge regression [10.434481202633458]
We revisit ridge regression on i.i.d. data in terms of the bias and variance of ridge regression in terms of the bias and variance of an equivalent' sequence model.
As a new application, we obtain a completely explicit and sharp characterization of ridge regression for Hilbert covariates with regularly varying spectrum.
arXiv Detail & Related papers (2022-10-16T16:01:05Z) - High-dimensional Asymptotics of Feature Learning: How One Gradient Step
Improves the Representation [89.21686761957383]
We study the first gradient descent step on the first-layer parameters $boldsymbolW$ in a two-layer network.
Our results demonstrate that even one step can lead to a considerable advantage over random features.
arXiv Detail & Related papers (2022-05-03T12:09:59Z) - $p$-Generalized Probit Regression and Scalable Maximum Likelihood
Estimation via Sketching and Coresets [74.37849422071206]
We study the $p$-generalized probit regression model, which is a generalized linear model for binary responses.
We show how the maximum likelihood estimator for $p$-generalized probit regression can be approximated efficiently up to a factor of $(1+varepsilon)$ on large data.
arXiv Detail & Related papers (2022-03-25T10:54:41Z) - Universality of empirical risk minimization [12.764655736673749]
Consider supervised learning from i.i.d. samples where $boldsymbol x_i inmathbbRp$ are feature vectors and $y in mathbbR$ are labels.
We study empirical risk universality over a class of functions that are parameterized by $mathsfk.
arXiv Detail & Related papers (2022-02-17T18:53:45Z) - Random matrices in service of ML footprint: ternary random features with
no performance loss [55.30329197651178]
We show that the eigenspectrum of $bf K$ is independent of the distribution of the i.i.d. entries of $bf w$.
We propose a novel random technique, called Ternary Random Feature (TRF)
The computation of the proposed random features requires no multiplication and a factor of $b$ less bits for storage compared to classical random features.
arXiv Detail & Related papers (2021-10-05T09:33:49Z) - Self-training Converts Weak Learners to Strong Learners in Mixture
Models [86.7137362125503]
We show that a pseudolabeler $boldsymbolbeta_mathrmpl$ can achieve classification error at most $C_mathrmerr$.
We additionally show that by running gradient descent on the logistic loss one can obtain a pseudolabeler $boldsymbolbeta_mathrmpl$ with classification error $C_mathrmerr$ using only $O(d)$ labeled examples.
arXiv Detail & Related papers (2021-06-25T17:59:16Z) - Optimal Combination of Linear and Spectral Estimators for Generalized
Linear Models [59.015960528781115]
We show how to optimally combine $hatboldsymbol xrm L$ and $hatboldsymbol xrm s$.
In order to establish the limiting distribution of $(boldsymbol x, hatboldsymbol xrm L, hatboldsymbol xrm s)$, we design and analyze an Approximate Message Passing (AMP) algorithm.
arXiv Detail & Related papers (2020-08-07T18:20:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.