A U-turn on Double Descent: Rethinking Parameter Counting in Statistical
Learning
- URL: http://arxiv.org/abs/2310.18988v1
- Date: Sun, 29 Oct 2023 12:05:39 GMT
- Title: A U-turn on Double Descent: Rethinking Parameter Counting in Statistical
Learning
- Authors: Alicia Curth, Alan Jeffares, Mihaela van der Schaar
- Abstract summary: We show that double descent appears exactly when and where it occurs, and that its location is not inherently tied to the threshold p=n.
This provides a resolution to tensions between double descent and statistical intuition.
- Score: 68.76846801719095
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Conventional statistical wisdom established a well-understood relationship
between model complexity and prediction error, typically presented as a
U-shaped curve reflecting a transition between under- and overfitting regimes.
However, motivated by the success of overparametrized neural networks, recent
influential work has suggested this theory to be generally incomplete,
introducing an additional regime that exhibits a second descent in test error
as the parameter count p grows past sample size n - a phenomenon dubbed double
descent. While most attention has naturally been given to the deep-learning
setting, double descent was shown to emerge more generally across non-neural
models: known cases include linear regression, trees, and boosting. In this
work, we take a closer look at evidence surrounding these more classical
statistical machine learning methods and challenge the claim that observed
cases of double descent truly extend the limits of a traditional U-shaped
complexity-generalization curve therein. We show that once careful
consideration is given to what is being plotted on the x-axes of their double
descent plots, it becomes apparent that there are implicitly multiple
complexity axes along which the parameter count grows. We demonstrate that the
second descent appears exactly (and only) when and where the transition between
these underlying axes occurs, and that its location is thus not inherently tied
to the interpolation threshold p=n. We then gain further insight by adopting a
classical nonparametric statistics perspective. We interpret the investigated
methods as smoothers and propose a generalized measure for the effective number
of parameters they use on unseen examples, using which we find that their
apparent double descent curves indeed fold back into more traditional convex
shapes - providing a resolution to tensions between double descent and
statistical intuition.
Related papers
- Bias in Pruned Vision Models: In-Depth Analysis and Countermeasures [93.17009514112702]
Pruning, setting a significant subset of the parameters of a neural network to zero, is one of the most popular methods of model compression.
Despite existing evidence for this phenomenon, the relationship between neural network pruning and induced bias is not well-understood.
arXiv Detail & Related papers (2023-04-25T07:42:06Z) - Double Descent Demystified: Identifying, Interpreting & Ablating the
Sources of a Deep Learning Puzzle [12.00962791565144]
Double descent is a surprising phenomenon in machine learning.
As the number of model parameters grows relative to the number of data, test error drops.
arXiv Detail & Related papers (2023-03-24T17:03:40Z) - Instance-Dependent Generalization Bounds via Optimal Transport [51.71650746285469]
Existing generalization bounds fail to explain crucial factors that drive the generalization of modern neural networks.
We derive instance-dependent generalization bounds that depend on the local Lipschitz regularity of the learned prediction function in the data space.
We empirically analyze our generalization bounds for neural networks, showing that the bound values are meaningful and capture the effect of popular regularization methods during training.
arXiv Detail & Related papers (2022-11-02T16:39:42Z) - Multi-scale Feature Learning Dynamics: Insights for Double Descent [71.91871020059857]
We study the phenomenon of "double descent" of the generalization error.
We find that double descent can be attributed to distinct features being learned at different scales.
arXiv Detail & Related papers (2021-12-06T18:17:08Z) - The Interplay Between Implicit Bias and Benign Overfitting in Two-Layer
Linear Networks [51.1848572349154]
neural network models that perfectly fit noisy data can generalize well to unseen test data.
We consider interpolating two-layer linear neural networks trained with gradient flow on the squared loss and derive bounds on the excess risk.
arXiv Detail & Related papers (2021-08-25T22:01:01Z) - On the Role of Optimization in Double Descent: A Least Squares Study [30.44215064390409]
We show an excess risk bound for the descent gradient solution of the least squares objective.
We find that in case of noiseless regression, double descent is explained solely by optimization-related quantities.
We empirically explore if our predictions hold for neural networks.
arXiv Detail & Related papers (2021-07-27T09:13:11Z) - Optimization Variance: Exploring Generalization Properties of DNNs [83.78477167211315]
The test error of a deep neural network (DNN) often demonstrates double descent.
We propose a novel metric, optimization variance (OV), to measure the diversity of model updates.
arXiv Detail & Related papers (2021-06-03T09:34:17Z) - Overparameterization and generalization error: weighted trigonometric
interpolation [4.631723879329972]
We study a random Fourier series model, where the task is to estimate the unknown Fourier coefficients from equidistant samples.
We show precisely how a bias towards smooth interpolants, in the form of weighted trigonometric generalization, can lead to smaller generalization error.
arXiv Detail & Related papers (2020-06-15T15:53:22Z) - Double Trouble in Double Descent : Bias and Variance(s) in the Lazy
Regime [32.65347128465841]
Deep neural networks can achieve remarkable performances while interpolating the training data perfectly.
Rather than the U-curve of the bias-variance trade-off, their test error often follows a "double descent"
We develop a quantitative theory for this phenomenon in the so-called lazy learning regime of neural networks.
arXiv Detail & Related papers (2020-03-02T17:39:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.