Regularization-wise double descent: Why it occurs and how to eliminate
it
- URL: http://arxiv.org/abs/2206.01378v1
- Date: Fri, 3 Jun 2022 03:23:58 GMT
- Title: Regularization-wise double descent: Why it occurs and how to eliminate
it
- Authors: Fatih Furkan Yilmaz, Reinhard Heckel
- Abstract summary: We show that the risk of explicit L2-regularized models can exhibit double descent behavior as a function of the regularization strength.
We study a two-layer neural network and show that double descent can be eliminated by adjusting the regularization strengths for the first and second layer.
- Score: 30.61588337557343
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The risk of overparameterized models, in particular deep neural networks, is
often double-descent shaped as a function of the model size. Recently, it was
shown that the risk as a function of the early-stopping time can also be
double-descent shaped, and this behavior can be explained as a super-position
of bias-variance tradeoffs. In this paper, we show that the risk of explicit
L2-regularized models can exhibit double descent behavior as a function of the
regularization strength, both in theory and practice. We find that for linear
regression, a double descent shaped risk is caused by a superposition of
bias-variance tradeoffs corresponding to different parts of the model and can
be mitigated by scaling the regularization strength of each part appropriately.
Motivated by this result, we study a two-layer neural network and show that
double descent can be eliminated by adjusting the regularization strengths for
the first and second layer. Lastly, we study a 5-layer CNN and ResNet-18
trained on CIFAR-10 with label noise, and CIFAR-100 without label noise, and
demonstrate that all exhibit double descent behavior as a function of the
regularization strength.
Related papers
- Towards understanding epoch-wise double descent in two-layer linear neural networks [11.210628847081097]
We study epoch-wise double descent in two-layer linear neural networks.
We identify additional factors of epoch-wise double descent emerging with the extra model layer.
This opens up for further questions regarding unidentified factors of epoch-wise double descent for truly deep models.
arXiv Detail & Related papers (2024-07-13T10:45:21Z) - The Surprising Harmfulness of Benign Overfitting for Adversarial
Robustness [13.120373493503772]
We prove a surprising result that even if the ground truth itself is robust to adversarial examples, the benignly overfitted model is benign in terms of the standard'' out-of-sample risk objective.
Our finding provides theoretical insights into the puzzling phenomenon observed in practice, where the true target function (e.g., human) is robust against adverasrial attack, while beginly overfitted neural networks lead to models that are not robust.
arXiv Detail & Related papers (2024-01-19T15:40:46Z) - A U-turn on Double Descent: Rethinking Parameter Counting in Statistical
Learning [68.76846801719095]
We show that double descent appears exactly when and where it occurs, and that its location is not inherently tied to the threshold p=n.
This provides a resolution to tensions between double descent and statistical intuition.
arXiv Detail & Related papers (2023-10-29T12:05:39Z) - DR3: Value-Based Deep Reinforcement Learning Requires Explicit
Regularization [125.5448293005647]
We discuss how the implicit regularization effect of SGD seen in supervised learning could in fact be harmful in the offline deep RL.
Our theoretical analysis shows that when existing models of implicit regularization are applied to temporal difference learning, the resulting derived regularizer favors degenerate solutions.
We propose a simple and effective explicit regularizer, called DR3, that counteracts the undesirable effects of this implicit regularizer.
arXiv Detail & Related papers (2021-12-09T06:01:01Z) - Multi-scale Feature Learning Dynamics: Insights for Double Descent [71.91871020059857]
We study the phenomenon of "double descent" of the generalization error.
We find that double descent can be attributed to distinct features being learned at different scales.
arXiv Detail & Related papers (2021-12-06T18:17:08Z) - On the Double Descent of Random Features Models Trained with SGD [78.0918823643911]
We study properties of random features (RF) regression in high dimensions optimized by gradient descent (SGD)
We derive precise non-asymptotic error bounds of RF regression under both constant and adaptive step-size SGD setting.
We observe the double descent phenomenon both theoretically and empirically.
arXiv Detail & Related papers (2021-10-13T17:47:39Z) - Estimation of Bivariate Structural Causal Models by Variational Gaussian
Process Regression Under Likelihoods Parametrised by Normalising Flows [74.85071867225533]
Causal mechanisms can be described by structural causal models.
One major drawback of state-of-the-art artificial intelligence is its lack of explainability.
arXiv Detail & Related papers (2021-09-06T14:52:58Z) - The Interplay Between Implicit Bias and Benign Overfitting in Two-Layer
Linear Networks [51.1848572349154]
neural network models that perfectly fit noisy data can generalize well to unseen test data.
We consider interpolating two-layer linear neural networks trained with gradient flow on the squared loss and derive bounds on the excess risk.
arXiv Detail & Related papers (2021-08-25T22:01:01Z) - Nonasymptotic theory for two-layer neural networks: Beyond the
bias-variance trade-off [10.182922771556742]
We present a nonasymptotic generalization theory for two-layer neural networks with ReLU activation function.
We show that overparametrized random feature models suffer from the curse of dimensionality and thus are suboptimal.
arXiv Detail & Related papers (2021-06-09T03:52:18Z) - Early Stopping in Deep Networks: Double Descent and How to Eliminate it [30.61588337557343]
We show that epoch-wise double descent arises because different parts of the network are learned at different epochs.
We study two standard convolutional networks empirically and show that eliminating epoch-wise double descent through adjusting stepsizes of different layers improves the early stopping performance significantly.
arXiv Detail & Related papers (2020-07-20T13:43:33Z) - Optimal Regularization Can Mitigate Double Descent [29.414119906479954]
We study whether the double-descent phenomenon can be avoided by using optimal regularization.
We demonstrate empirically that optimally-tuned $ell$ regularization can double descent for more general models, including neural networks.
arXiv Detail & Related papers (2020-03-04T05:19:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.