Training shallow ReLU networks on noisy data using hinge loss: when do
we overfit and is it benign?
- URL: http://arxiv.org/abs/2306.09955v2
- Date: Wed, 8 Nov 2023 19:02:56 GMT
- Title: Training shallow ReLU networks on noisy data using hinge loss: when do
we overfit and is it benign?
- Authors: Erin George, Michael Murray, William Swartworth, Deanna Needell
- Abstract summary: We study benign overfitting in two-layer ReLU networks trained using gradient descent and hinge loss on noisy data for binary classification.
We identify conditions on the margin of the clean data that give rise to three distinct training outcomes: benign overfitting, in which zero loss is achieved and with high probability test data is classified correctly; overfitting, in which zero loss is achieved but test data is misclassified with probability lower bounded by a constant; and non-overfitting, in which clean points, but not corrupt points, achieve zero loss and again with high probability test data is classified correctly.
- Score: 12.557493236305211
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We study benign overfitting in two-layer ReLU networks trained using gradient
descent and hinge loss on noisy data for binary classification. In particular,
we consider linearly separable data for which a relatively small proportion of
labels are corrupted or flipped. We identify conditions on the margin of the
clean data that give rise to three distinct training outcomes: benign
overfitting, in which zero loss is achieved and with high probability test data
is classified correctly; overfitting, in which zero loss is achieved but test
data is misclassified with probability lower bounded by a constant; and
non-overfitting, in which clean points, but not corrupt points, achieve zero
loss and again with high probability test data is classified correctly. Our
analysis provides a fine-grained description of the dynamics of neurons
throughout training and reveals two distinct phases: in the first phase clean
points achieve close to zero loss, in the second phase clean points oscillate
on the boundary of zero loss while corrupt points either converge towards zero
loss or are eventually zeroed by the network. We prove these results using a
combinatorial approach that involves bounding the number of clean versus
corrupt updates across these phases of training.
Related papers
- Implicit Bias of Gradient Descent for Two-layer ReLU and Leaky ReLU
Networks on Nearly-orthogonal Data [66.1211659120882]
The implicit bias towards solutions with favorable properties is believed to be a key reason why neural networks trained by gradient-based optimization can generalize well.
While the implicit bias of gradient flow has been widely studied for homogeneous neural networks (including ReLU and leaky ReLU networks), the implicit bias of gradient descent is currently only understood for smooth neural networks.
arXiv Detail & Related papers (2023-10-29T08:47:48Z) - Benign Overfitting in Linear Classifiers and Leaky ReLU Networks from
KKT Conditions for Margin Maximization [59.038366742773164]
Linears and leaky ReLU trained by gradient flow on logistic loss have an implicit bias towards satisfying the Karush-KuTucker (KKT) conditions.
In this work we establish a number of settings where the satisfaction of these conditions implies benign overfitting in linear classifiers and in two-layer leaky ReLU networks.
arXiv Detail & Related papers (2023-03-02T18:24:26Z) - The perils of being unhinged: On the accuracy of classifiers minimizing
a noise-robust convex loss [12.132641563193584]
van Rooyen et al. introduced a notion of convex loss functions being robust to random classification noise, and established that the "unhinged" loss function is robust in this sense.
In this note we study the accuracy of binary classifiers obtained by minimizing the unhinged loss, and observe that even for simple linearly separable data distributions, minimizing the unhinged loss may only yield a binary classifier with accuracy no better than random guessing.
arXiv Detail & Related papers (2021-12-08T20:57:20Z) - Understanding Square Loss in Training Overparametrized Neural Network
Classifiers [31.319145959402462]
We contribute to the theoretical understanding of square loss in classification by systematically investigating how it performs for overparametrized neural networks.
We consider two cases, according to whether classes are separable or not. In the general non-separable case, fast convergence rate is established for both misclassification rate and calibration error.
The resulting margin is proven to be lower bounded away from zero, providing theoretical guarantees for robustness.
arXiv Detail & Related papers (2021-12-07T12:12:30Z) - Mixing between the Cross Entropy and the Expectation Loss Terms [89.30385901335323]
Cross entropy loss tends to focus on hard to classify samples during training.
We show that adding to the optimization goal the expectation loss helps the network to achieve better accuracy.
Our experiments show that the new training protocol improves performance across a diverse set of classification domains.
arXiv Detail & Related papers (2021-09-12T23:14:06Z) - Sample Selection with Uncertainty of Losses for Learning with Noisy
Labels [145.06552420999986]
In learning with noisy labels, the sample selection approach is very popular, which regards small-loss data as correctly labeled during training.
However, losses are generated on-the-fly based on the model being trained with noisy labels, and thus large-loss data are likely but not certainly to be incorrect.
In this paper, we incorporate the uncertainty of losses by adopting interval estimation instead of point estimation of losses.
arXiv Detail & Related papers (2021-06-01T12:53:53Z) - Learning from Noisy Labels via Dynamic Loss Thresholding [69.61904305229446]
We propose a novel method named Dynamic Loss Thresholding (DLT)
During the training process, DLT records the loss value of each sample and calculates dynamic loss thresholds.
Experiments on CIFAR-10/100 and Clothing1M demonstrate substantial improvements over recent state-of-the-art methods.
arXiv Detail & Related papers (2021-04-01T07:59:03Z) - When does gradient descent with logistic loss find interpolating
two-layer networks? [51.1848572349154]
We show that gradient descent drives the training loss to zero if the initial loss is small enough.
When the data satisfies certain cluster and separation conditions and the network is wide enough, we show that one step of gradient descent reduces the loss sufficiently that the first result applies.
arXiv Detail & Related papers (2020-12-04T05:16:51Z) - Robust binary classification with the 01 loss [0.0]
We develop a coordinate descent algorithm for a linear 01 loss and a single hidden layer 01 loss neural network.
We show our algorithms to be fast and comparable in accuracy to the linear support vector machine and logistic loss single hidden layer network for binary classification.
arXiv Detail & Related papers (2020-02-09T20:41:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.