Early Stopping in Deep Networks: Double Descent and How to Eliminate it
- URL: http://arxiv.org/abs/2007.10099v2
- Date: Sat, 19 Sep 2020 22:21:12 GMT
- Title: Early Stopping in Deep Networks: Double Descent and How to Eliminate it
- Authors: Reinhard Heckel and Fatih Furkan Yilmaz
- Abstract summary: We show that epoch-wise double descent arises because different parts of the network are learned at different epochs.
We study two standard convolutional networks empirically and show that eliminating epoch-wise double descent through adjusting stepsizes of different layers improves the early stopping performance significantly.
- Score: 30.61588337557343
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Over-parameterized models, such as large deep networks, often exhibit a
double descent phenomenon, whereas a function of model size, error first
decreases, increases, and decreases at last. This intriguing double descent
behavior also occurs as a function of training epochs and has been conjectured
to arise because training epochs control the model complexity. In this paper,
we show that such epoch-wise double descent arises for a different reason: It
is caused by a superposition of two or more bias-variance tradeoffs that arise
because different parts of the network are learned at different epochs, and
eliminating this by proper scaling of stepsizes can significantly improve the
early stopping performance. We show this analytically for i) linear regression,
where differently scaled features give rise to a superposition of bias-variance
tradeoffs, and for ii) a two-layer neural network, where the first and second
layer each govern a bias-variance tradeoff. Inspired by this theory, we study
two standard convolutional networks empirically and show that eliminating
epoch-wise double descent through adjusting stepsizes of different layers
improves the early stopping performance significantly.
Related papers
- Towards understanding epoch-wise double descent in two-layer linear neural networks [11.210628847081097]
We study epoch-wise double descent in two-layer linear neural networks.
We identify additional factors of epoch-wise double descent emerging with the extra model layer.
This opens up for further questions regarding unidentified factors of epoch-wise double descent for truly deep models.
arXiv Detail & Related papers (2024-07-13T10:45:21Z) - Understanding the Double Descent Phenomenon in Deep Learning [49.1574468325115]
This tutorial sets the classical statistical learning framework and introduces the double descent phenomenon.
By looking at a number of examples, section 2 introduces inductive biases that appear to have a key role in double descent by selecting.
section 3 explores the double descent with two linear models, and gives other points of view from recent related works.
arXiv Detail & Related papers (2024-03-15T16:51:24Z) - A U-turn on Double Descent: Rethinking Parameter Counting in Statistical
Learning [68.76846801719095]
We show that double descent appears exactly when and where it occurs, and that its location is not inherently tied to the threshold p=n.
This provides a resolution to tensions between double descent and statistical intuition.
arXiv Detail & Related papers (2023-10-29T12:05:39Z) - Learning time-scales in two-layers neural networks [11.878594839685471]
We study the gradient flow dynamics of a wide two-layer neural network in high-dimension.
Based on new rigorous results, we propose a scenario for the learning dynamics in this setting.
arXiv Detail & Related papers (2023-02-28T19:52:26Z) - Theoretical Characterization of How Neural Network Pruning Affects its
Generalization [131.1347309639727]
This work makes the first attempt to study how different pruning fractions affect the model's gradient descent dynamics and generalization.
It is shown that as long as the pruning fraction is below a certain threshold, gradient descent can drive the training loss toward zero.
More surprisingly, the generalization bound gets better as the pruning fraction gets larger.
arXiv Detail & Related papers (2023-01-01T03:10:45Z) - Implicit Bias in Leaky ReLU Networks Trained on High-Dimensional Data [63.34506218832164]
In this work, we investigate the implicit bias of gradient flow and gradient descent in two-layer fully-connected neural networks with ReLU activations.
For gradient flow, we leverage recent work on the implicit bias for homogeneous neural networks to show that leakyally, gradient flow produces a neural network with rank at most two.
For gradient descent, provided the random variance is small enough, we show that a single step of gradient descent suffices to drastically reduce the rank of the network, and that the rank remains small throughout training.
arXiv Detail & Related papers (2022-10-13T15:09:54Z) - Regularization-wise double descent: Why it occurs and how to eliminate
it [30.61588337557343]
We show that the risk of explicit L2-regularized models can exhibit double descent behavior as a function of the regularization strength.
We study a two-layer neural network and show that double descent can be eliminated by adjusting the regularization strengths for the first and second layer.
arXiv Detail & Related papers (2022-06-03T03:23:58Z) - Multi-scale Feature Learning Dynamics: Insights for Double Descent [71.91871020059857]
We study the phenomenon of "double descent" of the generalization error.
We find that double descent can be attributed to distinct features being learned at different scales.
arXiv Detail & Related papers (2021-12-06T18:17:08Z) - When and how epochwise double descent happens [7.512375012141203]
An epochwise double descent' effect exists in which the generalization error initially drops, then rises, and finally drops again with increasing training time.
This presents a practical problem in that the amount of time required for training is long, and early stopping based on validation performance may result in suboptimal generalization.
We show that epochwise double descent requires a critical amount of noise to occur, but above a second critical noise level early stopping remains effective.
arXiv Detail & Related papers (2021-08-26T19:19:17Z) - Optimization Variance: Exploring Generalization Properties of DNNs [83.78477167211315]
The test error of a deep neural network (DNN) often demonstrates double descent.
We propose a novel metric, optimization variance (OV), to measure the diversity of model updates.
arXiv Detail & Related papers (2021-06-03T09:34:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.