Max-Margin Works while Large Margin Fails: Generalization without
Uniform Convergence
- URL: http://arxiv.org/abs/2206.07892v1
- Date: Thu, 16 Jun 2022 02:46:26 GMT
- Title: Max-Margin Works while Large Margin Fails: Generalization without
Uniform Convergence
- Authors: Margalit Glasgow, Colin Wei, Mary Wootters, Tengyu Ma
- Abstract summary: Existing tools rely on em uniform convergence em (UC), which guarantees that the test loss will be close to the training loss, uniformly over a class of candidate models.
Nagarajan and Kolter show that in certain simple linear and neural-network settings, any uniform convergence bound will be vacuous, leaving open the question of how to prove generalization in settings where UC fails.
We prove a new type of margin bound showing that above a certain signal-to-noise threshold, any near-max-margin will achieve almost no test loss in these two settings.
- Score: 55.26547500184405
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: A major challenge in modern machine learning is theoretically understanding
the generalization properties of overparameterized models. Many existing tools
rely on \em uniform convergence \em (UC), a property that, when it holds,
guarantees that the test loss will be close to the training loss, uniformly
over a class of candidate models. Nagarajan and Kolter (2019) show that in
certain simple linear and neural-network settings, any uniform convergence
bound will be vacuous, leaving open the question of how to prove generalization
in settings where UC fails. Our main contribution is proving novel
generalization bounds in two such settings, one linear, and one non-linear. We
study the linear classification setting of Nagarajan and Kolter, and a
quadratic ground truth function learned via a two-layer neural network in the
non-linear regime. We prove a new type of margin bound showing that above a
certain signal-to-noise threshold, any near-max-margin classifier will achieve
almost no test loss in these two settings. Our results show that
near-max-margin is important: while any model that achieves at least a $(1 -
\epsilon)$-fraction of the max-margin generalizes well, a classifier achieving
half of the max-margin may fail terribly. We additionally strengthen the UC
impossibility results of Nagarajan and Kolter, proving that \em one-sided \em
UC bounds and classical margin bounds will fail on near-max-margin classifiers.
Our analysis provides insight on why memorization can coexist with
generalization: we show that in this challenging regime where generalization
occurs but UC fails, near-max-margin classifiers simultaneously contain some
generalizable components and some overfitting components that memorize the
data. The presence of the overfitting components is enough to preclude UC, but
the near-extremal margin guarantees that sufficient generalizable components
are present.
Related papers
- Smoothed Analysis for Learning Concepts with Low Intrinsic Dimension [17.485243410774814]
In traditional models of supervised learning, the goal of a learner is to output a hypothesis that is competitive (to within $epsilon$) of the best fitting concept from some class.
We introduce a smoothed-analysis framework that requires a learner to compete only with the best agnostic.
We obtain the first algorithm forally learning intersections of $k$-halfspaces in time.
arXiv Detail & Related papers (2024-07-01T04:58:36Z) - Which Features are Learnt by Contrastive Learning? On the Role of
Simplicity Bias in Class Collapse and Feature Suppression [59.97965005675144]
Contrastive learning (CL) has emerged as a powerful technique for representation learning, with or without label supervision.
We provide the first unified theoretically rigorous framework to determine textitwhich features are learnt by CL.
We present increasing embedding dimensionality and improving the quality of data augmentations as two theoretically motivated solutions.
arXiv Detail & Related papers (2023-05-25T23:37:22Z) - Theoretical Characterization of the Generalization Performance of
Overfitted Meta-Learning [70.52689048213398]
This paper studies the performance of overfitted meta-learning under a linear regression model with Gaussian features.
We find new and interesting properties that do not exist in single-task linear regression.
Our analysis suggests that benign overfitting is more significant and easier to observe when the noise and the diversity/fluctuation of the ground truth of each training task are large.
arXiv Detail & Related papers (2023-04-09T20:36:13Z) - A Non-Asymptotic Moreau Envelope Theory for High-Dimensional Generalized
Linear Models [33.36787620121057]
We prove a new generalization bound that shows for any class of linear predictors in Gaussian space.
We use our finite-sample bound to directly recover the "optimistic rate" of Zhou et al. (2021)
We show that application of our bound generalization using localized Gaussian width will generally be sharp for empirical risk minimizers.
arXiv Detail & Related papers (2022-10-21T16:16:55Z) - Distribution of Classification Margins: Are All Data Equal? [61.16681488656473]
We motivate theoretically and show empirically that the area under the curve of the margin distribution on the training set is in fact a good measure of generalization.
The resulting subset of "high capacity" features is not consistent across different training runs.
arXiv Detail & Related papers (2021-07-21T16:41:57Z) - Benign Overfitting in Multiclass Classification: All Roads Lead to
Interpolation [39.02017410837255]
We study benign overfitting in multiclass linear classification.
We consider the following training algorithms on separable data.
We derive novel bounds on the accuracy of the MNI classifier.
arXiv Detail & Related papers (2021-06-21T05:34:36Z) - Learning Gaussian Mixtures with Generalised Linear Models: Precise
Asymptotics in High-dimensions [79.35722941720734]
Generalised linear models for multi-class classification problems are one of the fundamental building blocks of modern machine learning tasks.
We prove exacts characterising the estimator in high-dimensions via empirical risk minimisation.
We discuss how our theory can be applied beyond the scope of synthetic data.
arXiv Detail & Related papers (2021-06-07T16:53:56Z) - Towards an Understanding of Benign Overfitting in Neural Networks [104.2956323934544]
Modern machine learning models often employ a huge number of parameters and are typically optimized to have zero training loss.
We examine how these benign overfitting phenomena occur in a two-layer neural network setting.
We show that it is possible for the two-layer ReLU network interpolator to achieve a near minimax-optimal learning rate.
arXiv Detail & Related papers (2021-06-06T19:08:53Z) - Localization, Convexity, and Star Aggregation [0.0]
Offset Rademacher complexities have been shown to imply sharp, linear-dependent upper bounds for the square loss.
We show that in the statistical setting, the offset bound can be generalized to any loss satisfying certain uniform convexity.
arXiv Detail & Related papers (2021-05-19T00:47:59Z) - Robustifying Binary Classification to Adversarial Perturbation [45.347651499585055]
In this paper we consider the problem of binary classification with adversarial perturbations.
We introduce a generalization to the max-margin classifier which takes into account the power of the adversary in manipulating the data.
Under some mild assumptions on the loss function, we theoretically show that the gradient descents converge to the RM classifier in its direction.
arXiv Detail & Related papers (2020-10-29T07:20:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.