When and how epochwise double descent happens
- URL: http://arxiv.org/abs/2108.12006v1
- Date: Thu, 26 Aug 2021 19:19:17 GMT
- Title: When and how epochwise double descent happens
- Authors: Cory Stephenson, Tyler Lee
- Abstract summary: An epochwise double descent' effect exists in which the generalization error initially drops, then rises, and finally drops again with increasing training time.
This presents a practical problem in that the amount of time required for training is long, and early stopping based on validation performance may result in suboptimal generalization.
We show that epochwise double descent requires a critical amount of noise to occur, but above a second critical noise level early stopping remains effective.
- Score: 7.512375012141203
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Deep neural networks are known to exhibit a `double descent' behavior as the
number of parameters increases. Recently, it has also been shown that an
`epochwise double descent' effect exists in which the generalization error
initially drops, then rises, and finally drops again with increasing training
time. This presents a practical problem in that the amount of time required for
training is long, and early stopping based on validation performance may result
in suboptimal generalization. In this work we develop an analytically tractable
model of epochwise double descent that allows us to characterise theoretically
when this effect is likely to occur. This model is based on the hypothesis that
the training data contains features that are slow to learn but informative. We
then show experimentally that deep neural networks behave similarly to our
theoretical model. Our findings indicate that epochwise double descent requires
a critical amount of noise to occur, but above a second critical noise level
early stopping remains effective. Using insights from theory, we give two
methods by which epochwise double descent can be removed: one that removes slow
to learn features from the input and reduces generalization performance, and
another that instead modifies the training dynamics and matches or exceeds the
generalization performance of standard training. Taken together, our results
suggest a new picture of how epochwise double descent emerges from the
interplay between the dynamics of training and noise in the training data.
Related papers
- Understanding and Mitigating the Label Noise in Pre-training on
Downstream Tasks [91.15120211190519]
This paper aims to understand the nature of noise in pre-training datasets and to mitigate its impact on downstream tasks.
We propose a light-weight black-box tuning method (NMTune) to affine the feature space to mitigate the malignant effect of noise.
arXiv Detail & Related papers (2023-09-29T06:18:15Z) - Theoretical Characterization of How Neural Network Pruning Affects its
Generalization [131.1347309639727]
This work makes the first attempt to study how different pruning fractions affect the model's gradient descent dynamics and generalization.
It is shown that as long as the pruning fraction is below a certain threshold, gradient descent can drive the training loss toward zero.
More surprisingly, the generalization bound gets better as the pruning fraction gets larger.
arXiv Detail & Related papers (2023-01-01T03:10:45Z) - Benign Overfitting in Two-layer Convolutional Neural Networks [90.75603889605043]
We study the benign overfitting phenomenon in training a two-layer convolutional neural network (CNN)
We show that when the signal-to-noise ratio satisfies a certain condition, a two-layer CNN trained by gradient descent can achieve arbitrarily small training and test loss.
On the other hand, when this condition does not hold, overfitting becomes harmful and the obtained CNN can only achieve constant level test loss.
arXiv Detail & Related papers (2022-02-14T07:45:51Z) - Multi-scale Feature Learning Dynamics: Insights for Double Descent [71.91871020059857]
We study the phenomenon of "double descent" of the generalization error.
We find that double descent can be attributed to distinct features being learned at different scales.
arXiv Detail & Related papers (2021-12-06T18:17:08Z) - Double Descent Optimization Pattern and Aliasing: Caveats of Noisy
Labels [1.4424394176890545]
This work confirms that double descent occurs with small datasets and noisy labels.
We show that increasing the learning rate can create analiasing effect that masks the double descent pattern without suppressing it.
We show that they translate to a real world application: the forecast of events in epileptic patients from continuous electroencephalographic recordings.
arXiv Detail & Related papers (2021-06-03T19:41:40Z) - Early Stopping in Deep Networks: Double Descent and How to Eliminate it [30.61588337557343]
We show that epoch-wise double descent arises because different parts of the network are learned at different epochs.
We study two standard convolutional networks empirically and show that eliminating epoch-wise double descent through adjusting stepsizes of different layers improves the early stopping performance significantly.
arXiv Detail & Related papers (2020-07-20T13:43:33Z) - Feature Purification: How Adversarial Training Performs Robust Deep
Learning [66.05472746340142]
We show a principle that we call Feature Purification, where we show one of the causes of the existence of adversarial examples is the accumulation of certain small dense mixtures in the hidden weights during the training process of a neural network.
We present both experiments on the CIFAR-10 dataset to illustrate this principle, and a theoretical result proving that for certain natural classification tasks, training a two-layer neural network with ReLU activation using randomly gradient descent indeed this principle.
arXiv Detail & Related papers (2020-05-20T16:56:08Z) - Overfitting in adversarially robust deep learning [86.11788847990783]
We show that overfitting to the training set does in fact harm robust performance to a very large degree in adversarially robust training.
We also show that effects such as the double descent curve do still occur in adversarially trained models, yet fail to explain the observed overfitting.
arXiv Detail & Related papers (2020-02-26T15:40:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.