Double Descent Optimization Pattern and Aliasing: Caveats of Noisy
Labels
- URL: http://arxiv.org/abs/2106.02100v1
- Date: Thu, 3 Jun 2021 19:41:40 GMT
- Title: Double Descent Optimization Pattern and Aliasing: Caveats of Noisy
Labels
- Authors: Florian Dubost, Khaled Kamal Saab, Erin Hong, Daniel Yang Fu, Max
Pike, Siddharth Sharma, Siyi Tang, Nandita Bhaskhar, Christopher Lee-Messer,
Daniel Rubin
- Abstract summary: This work confirms that double descent occurs with small datasets and noisy labels.
We show that increasing the learning rate can create analiasing effect that masks the double descent pattern without suppressing it.
We show that they translate to a real world application: the forecast of events in epileptic patients from continuous electroencephalographic recordings.
- Score: 1.4424394176890545
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Optimization plays a key role in the training of deep neural networks.
Deciding when to stop training can have a substantial impact on the performance
of the network during inference. Under certain conditions, the generalization
error can display a double descent pattern during training: the learning curve
is non-monotonic and seemingly diverges before converging again after
additional epochs. This optimization pattern can lead to early stopping
procedures to stop training before the second convergence and consequently
select a suboptimal set of parameters for the network, with worse performance
during inference. In this work, in addition to confirming that double descent
occurs with small datasets and noisy labels as evidenced by others, we show
that noisy labels must be present both in the training and generalization sets
to observe a double descent pattern. We also show that the learning rate has an
influence on double descent, and study how different optimizers and optimizer
parameters influence the apparition of double descent. Finally, we show that
increasing the learning rate can create an aliasing effect that masks the
double descent pattern without suppressing it. We study this phenomenon through
extensive experiments on variants of CIFAR-10 and show that they translate to a
real world application: the forecast of seizure events in epileptic patients
from continuous electroencephalographic recordings.
Related papers
- The Epochal Sawtooth Effect: Unveiling Training Loss Oscillations in Adam and Other Optimizers [8.770864706004472]
We identify and analyze a recurring training loss pattern, which we term the textitEpochal Sawtooth Effect (ESE)
This pattern is characterized by a sharp drop in loss at the beginning of each epoch, followed by a gradual increase, resulting in a sawtooth-shaped loss curve.
We provide an in-depth explanation of the underlying mechanisms that lead to the Epochal Sawtooth Effect.
arXiv Detail & Related papers (2024-10-14T00:51:21Z) - Understanding the Role of Optimization in Double Descent [8.010193718024347]
We propose a simple optimization-based explanation for why double descent sometimes occurs weakly or not at all.
To the best of our knowledge, we are the first to demonstrate that many disparate factors contributing to model-wise double descent are unified from the viewpoint of optimization.
Our results suggest the following implication: Double descent is unlikely to be a problem for real-world machine learning setups.
arXiv Detail & Related papers (2023-12-06T23:29:00Z) - A U-turn on Double Descent: Rethinking Parameter Counting in Statistical
Learning [68.76846801719095]
We show that double descent appears exactly when and where it occurs, and that its location is not inherently tied to the threshold p=n.
This provides a resolution to tensions between double descent and statistical intuition.
arXiv Detail & Related papers (2023-10-29T12:05:39Z) - Theoretical Characterization of How Neural Network Pruning Affects its
Generalization [131.1347309639727]
This work makes the first attempt to study how different pruning fractions affect the model's gradient descent dynamics and generalization.
It is shown that as long as the pruning fraction is below a certain threshold, gradient descent can drive the training loss toward zero.
More surprisingly, the generalization bound gets better as the pruning fraction gets larger.
arXiv Detail & Related papers (2023-01-01T03:10:45Z) - Deep Double Descent via Smooth Interpolation [2.141079906482723]
We quantify sharpness of fit of training data by studying the loss landscape w.r.t. to the input variable locally to each training point.
Our findings show that loss sharpness in the input space follows both model- and epoch-wise double descent, with worse peaks observed around noisy targets.
While small interpolating models sharply fit both clean and noisy data, large interpolating models express a smooth loss landscape, in contrast to existing intuition.
arXiv Detail & Related papers (2022-09-21T02:46:13Z) - Multi-scale Feature Learning Dynamics: Insights for Double Descent [71.91871020059857]
We study the phenomenon of "double descent" of the generalization error.
We find that double descent can be attributed to distinct features being learned at different scales.
arXiv Detail & Related papers (2021-12-06T18:17:08Z) - When and how epochwise double descent happens [7.512375012141203]
An epochwise double descent' effect exists in which the generalization error initially drops, then rises, and finally drops again with increasing training time.
This presents a practical problem in that the amount of time required for training is long, and early stopping based on validation performance may result in suboptimal generalization.
We show that epochwise double descent requires a critical amount of noise to occur, but above a second critical noise level early stopping remains effective.
arXiv Detail & Related papers (2021-08-26T19:19:17Z) - Towards an Understanding of Benign Overfitting in Neural Networks [104.2956323934544]
Modern machine learning models often employ a huge number of parameters and are typically optimized to have zero training loss.
We examine how these benign overfitting phenomena occur in a two-layer neural network setting.
We show that it is possible for the two-layer ReLU network interpolator to achieve a near minimax-optimal learning rate.
arXiv Detail & Related papers (2021-06-06T19:08:53Z) - Critical Parameters for Scalable Distributed Learning with Large Batches
and Asynchronous Updates [67.19481956584465]
It has been experimentally observed that the efficiency of distributed training with saturation (SGD) depends decisively on the batch size and -- in implementations -- on the staleness.
We show that our results are tight and illustrate key findings in numerical experiments.
arXiv Detail & Related papers (2021-03-03T12:08:23Z) - Early Stopping in Deep Networks: Double Descent and How to Eliminate it [30.61588337557343]
We show that epoch-wise double descent arises because different parts of the network are learned at different epochs.
We study two standard convolutional networks empirically and show that eliminating epoch-wise double descent through adjusting stepsizes of different layers improves the early stopping performance significantly.
arXiv Detail & Related papers (2020-07-20T13:43:33Z) - Overfitting in adversarially robust deep learning [86.11788847990783]
We show that overfitting to the training set does in fact harm robust performance to a very large degree in adversarially robust training.
We also show that effects such as the double descent curve do still occur in adversarially trained models, yet fail to explain the observed overfitting.
arXiv Detail & Related papers (2020-02-26T15:40:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.