Related papers: Distribution of Classification Margins: Are All Data Equal?

Distribution of Classification Margins: Are All Data Equal?

URL: http://arxiv.org/abs/2107.10199v1
Date: Wed, 21 Jul 2021 16:41:57 GMT
Title: Distribution of Classification Margins: Are All Data Equal?
Authors: Andrzej Banburski, Fernanda De La Torre, Nishka Pant, Ishana Shastri, Tomaso Poggio
Abstract summary: We motivate theoretically and show empirically that the area under the curve of the margin distribution on the training set is in fact a good measure of generalization. The resulting subset of "high capacity" features is not consistent across different training runs.
Score: 61.16681488656473
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent theoretical results show that gradient descent on deep neural networks under exponential loss functions locally maximizes classification margin, which is equivalent to minimizing the norm of the weight matrices under margin constraints. This property of the solution however does not fully characterize the generalization performance. We motivate theoretically and show empirically that the area under the curve of the margin distribution on the training set is in fact a good measure of generalization. We then show that, after data separation is achieved, it is possible to dynamically reduce the training set by more than 99% without significant loss of performance. Interestingly, the resulting subset of "high capacity" features is not consistent across different training runs, which is consistent with the theoretical claim that all training points should converge to the same asymptotic margin under SGD and in the presence of both batch normalization and weight decay.

Related papers

Towards Better Generalization: Weight Decay Induces Low-rank Bias for Neural Networks [9.948870430491738]
We study the implicit bias towards low-rank weight matrices when training neural networks with Weight Decay (WD) Our work offers both theoretical and empirical insights into the strong generalization performance of SGD when combined with WD.
arXiv Detail & Related papers (2024-10-03T03:36:18Z)
On the Dynamics Under the Unhinged Loss and Beyond [104.49565602940699]
We introduce the unhinged loss, a concise loss function, that offers more mathematical opportunities to analyze closed-form dynamics. The unhinged loss allows for considering more practical techniques, such as time-vary learning rates and feature normalization.
arXiv Detail & Related papers (2023-12-13T02:11:07Z)
The Implicit Bias of Batch Normalization in Linear Models and Two-layer Linear Convolutional Neural Networks [117.93273337740442]
We show that gradient descent converges to a uniform margin classifier on the training data with an $exp(-Omega(log2 t))$ convergence rate. We also show that batch normalization has an implicit bias towards a patch-wise uniform margin.
arXiv Detail & Related papers (2023-06-20T16:58:00Z)
Inconsistency, Instability, and Generalization Gap of Deep Neural Network Training [14.871738070617491]
We show that inconsistency is a more reliable indicator of generalization gap than the sharpness of the loss landscape. The results also provide a theoretical basis for existing methods such as co-distillation and ensemble.
arXiv Detail & Related papers (2023-05-31T20:28:13Z)
Theoretical Characterization of the Generalization Performance of Overfitted Meta-Learning [70.52689048213398]
This paper studies the performance of overfitted meta-learning under a linear regression model with Gaussian features. We find new and interesting properties that do not exist in single-task linear regression. Our analysis suggests that benign overfitting is more significant and easier to observe when the noise and the diversity/fluctuation of the ground truth of each training task are large.
arXiv Detail & Related papers (2023-04-09T20:36:13Z)
Understanding the Generalization of Adam in Learning Neural Networks with Proper Regularization [118.50301177912381]
We show that Adam can converge to different solutions of the objective with provably different errors, even with weight decay globalization. We show that if convex, and the weight decay regularization is employed, any optimization algorithms including Adam will converge to the same solution.
arXiv Detail & Related papers (2021-08-25T17:58:21Z)
Measuring Generalization with Optimal Transport [111.29415509046886]
We develop margin-based generalization bounds, where the margins are normalized with optimal transport costs. Our bounds robustly predict the generalization error, given training data and network parameters, on large scale datasets.
arXiv Detail & Related papers (2021-06-07T03:04:59Z)
Scaling Ensemble Distribution Distillation to Many Classes with Proxy Targets [12.461503242570643]
emphEnsemble Distribution Distillation is an approach that allows a single model to efficiently capture both the predictive performance and uncertainty estimates of an ensemble. For classification, this is achieved by training a Dirichlet distribution over the ensemble members' output distributions via the maximum likelihood criterion. Although theoretically, this criterion exhibits poor convergence when applied to large-scale tasks where the number of classes is very high.
arXiv Detail & Related papers (2021-05-14T17:50:14Z)
Explicit regularization and implicit bias in deep network classifiers trained with the square loss [2.8935588665357077]
Deep ReLU networks trained with the square loss have been observed to perform well in classification tasks. We show that convergence to a solution with the absolute minimum norm is expected when normalization techniques are used together with Weight Decay.
arXiv Detail & Related papers (2020-12-31T21:07:56Z)
The Implicit Bias of Gradient Descent on Separable Data [44.98410310356165]
We show the predictor converges to the direction of the max-margin (hard margin SVM) solution. This can help explain the benefit of continuing to optimize the logistic or cross-entropy loss even after the training error is zero.
arXiv Detail & Related papers (2017-10-27T21:47:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.