Related papers: Regularization-wise double descent: Why it occurs and how to eliminate it

Regularization-wise double descent: Why it occurs and how to eliminate it

URL: http://arxiv.org/abs/2206.01378v1
Date: Fri, 3 Jun 2022 03:23:58 GMT
Title: Regularization-wise double descent: Why it occurs and how to eliminate it
Authors: Fatih Furkan Yilmaz, Reinhard Heckel
Abstract summary: We show that the risk of explicit L2-regularized models can exhibit double descent behavior as a function of the regularization strength. We study a two-layer neural network and show that double descent can be eliminated by adjusting the regularization strengths for the first and second layer.
Score: 30.61588337557343
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The risk of overparameterized models, in particular deep neural networks, is often double-descent shaped as a function of the model size. Recently, it was shown that the risk as a function of the early-stopping time can also be double-descent shaped, and this behavior can be explained as a super-position of bias-variance tradeoffs. In this paper, we show that the risk of explicit L2-regularized models can exhibit double descent behavior as a function of the regularization strength, both in theory and practice. We find that for linear regression, a double descent shaped risk is caused by a superposition of bias-variance tradeoffs corresponding to different parts of the model and can be mitigated by scaling the regularization strength of each part appropriately. Motivated by this result, we study a two-layer neural network and show that double descent can be eliminated by adjusting the regularization strengths for the first and second layer. Lastly, we study a 5-layer CNN and ResNet-18 trained on CIFAR-10 with label noise, and CIFAR-100 without label noise, and demonstrate that all exhibit double descent behavior as a function of the regularization strength.

Related papers

Towards understanding epoch-wise double descent in two-layer linear neural networks [11.210628847081097]
We study epoch-wise double descent in two-layer linear neural networks. We identify additional factors of epoch-wise double descent emerging with the extra model layer. This opens up for further questions regarding unidentified factors of epoch-wise double descent for truly deep models.
arXiv Detail & Related papers (2024-07-13T10:45:21Z)
The Surprising Harmfulness of Benign Overfitting for Adversarial Robustness [13.120373493503772]
We prove a surprising result that even if the ground truth itself is robust to adversarial examples, the benignly overfitted model is benign in terms of the standard'' out-of-sample risk objective. Our finding provides theoretical insights into the puzzling phenomenon observed in practice, where the true target function (e.g., human) is robust against adverasrial attack, while beginly overfitted neural networks lead to models that are not robust.
arXiv Detail & Related papers (2024-01-19T15:40:46Z)
A U-turn on Double Descent: Rethinking Parameter Counting in Statistical Learning [68.76846801719095]
We show that double descent appears exactly when and where it occurs, and that its location is not inherently tied to the threshold p=n. This provides a resolution to tensions between double descent and statistical intuition.
arXiv Detail & Related papers (2023-10-29T12:05:39Z)
DR3: Value-Based Deep Reinforcement Learning Requires Explicit Regularization [125.5448293005647]
We discuss how the implicit regularization effect of SGD seen in supervised learning could in fact be harmful in the offline deep RL. Our theoretical analysis shows that when existing models of implicit regularization are applied to temporal difference learning, the resulting derived regularizer favors degenerate solutions. We propose a simple and effective explicit regularizer, called DR3, that counteracts the undesirable effects of this implicit regularizer.
arXiv Detail & Related papers (2021-12-09T06:01:01Z)
Multi-scale Feature Learning Dynamics: Insights for Double Descent [71.91871020059857]
We study the phenomenon of "double descent" of the generalization error. We find that double descent can be attributed to distinct features being learned at different scales.
arXiv Detail & Related papers (2021-12-06T18:17:08Z)
On the Double Descent of Random Features Models Trained with SGD [78.0918823643911]
We study properties of random features (RF) regression in high dimensions optimized by gradient descent (SGD) We derive precise non-asymptotic error bounds of RF regression under both constant and adaptive step-size SGD setting. We observe the double descent phenomenon both theoretically and empirically.
arXiv Detail & Related papers (2021-10-13T17:47:39Z)
Estimation of Bivariate Structural Causal Models by Variational Gaussian Process Regression Under Likelihoods Parametrised by Normalising Flows [74.85071867225533]
Causal mechanisms can be described by structural causal models. One major drawback of state-of-the-art artificial intelligence is its lack of explainability.
arXiv Detail & Related papers (2021-09-06T14:52:58Z)
The Interplay Between Implicit Bias and Benign Overfitting in Two-Layer Linear Networks [51.1848572349154]
neural network models that perfectly fit noisy data can generalize well to unseen test data. We consider interpolating two-layer linear neural networks trained with gradient flow on the squared loss and derive bounds on the excess risk.
arXiv Detail & Related papers (2021-08-25T22:01:01Z)
Nonasymptotic theory for two-layer neural networks: Beyond the bias-variance trade-off [10.182922771556742]
We present a nonasymptotic generalization theory for two-layer neural networks with ReLU activation function. We show that overparametrized random feature models suffer from the curse of dimensionality and thus are suboptimal.
arXiv Detail & Related papers (2021-06-09T03:52:18Z)
Early Stopping in Deep Networks: Double Descent and How to Eliminate it [30.61588337557343]
We show that epoch-wise double descent arises because different parts of the network are learned at different epochs. We study two standard convolutional networks empirically and show that eliminating epoch-wise double descent through adjusting stepsizes of different layers improves the early stopping performance significantly.
arXiv Detail & Related papers (2020-07-20T13:43:33Z)
Optimal Regularization Can Mitigate Double Descent [29.414119906479954]
We study whether the double-descent phenomenon can be avoided by using optimal regularization. We demonstrate empirically that optimally-tuned $ell$ regularization can double descent for more general models, including neural networks.
arXiv Detail & Related papers (2020-03-04T05:19:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.