Triple descent and the two kinds of overfitting: Where & why do they
appear?
- URL: http://arxiv.org/abs/2006.03509v2
- Date: Tue, 13 Oct 2020 09:05:01 GMT
- Title: Triple descent and the two kinds of overfitting: Where & why do they
appear?
- Authors: St\'ephane d'Ascoli, Levent Sagun, Giulio Biroli
- Abstract summary: We show that despite their apparent similarity, both peaks can co-exist when neural networks are applied to noisy regression tasks.
The relative size of the peaks is then governed by the degree of nonlinearity of the activation function.
We show that this peak is implicitly regularized by the nonlinearity, which is why it only becomes salient at high noise.
- Score: 16.83019116094311
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: A recent line of research has highlighted the existence of a "double descent"
phenomenon in deep learning, whereby increasing the number of training examples
$N$ causes the generalization error of neural networks to peak when $N$ is of
the same order as the number of parameters $P$. In earlier works, a similar
phenomenon was shown to exist in simpler models such as linear regression,
where the peak instead occurs when $N$ is equal to the input dimension $D$.
Since both peaks coincide with the interpolation threshold, they are often
conflated in the litterature. In this paper, we show that despite their
apparent similarity, these two scenarios are inherently different. In fact,
both peaks can co-exist when neural networks are applied to noisy regression
tasks. The relative size of the peaks is then governed by the degree of
nonlinearity of the activation function. Building on recent developments in the
analysis of random feature models, we provide a theoretical ground for this
sample-wise triple descent. As shown previously, the nonlinear peak at
$N\!=\!P$ is a true divergence caused by the extreme sensitivity of the output
function to both the noise corrupting the labels and the initialization of the
random features (or the weights in neural networks). This peak survives in the
absence of noise, but can be suppressed by regularization. In contrast, the
linear peak at $N\!=\!D$ is solely due to overfitting the noise in the labels,
and forms earlier during training. We show that this peak is implicitly
regularized by the nonlinearity, which is why it only becomes salient at high
noise and is weakly affected by explicit regularization. Throughout the paper,
we compare analytical results obtained in the random feature model with the
outcomes of numerical experiments involving deep neural networks.
Related papers
- Bayesian Inference with Deep Weakly Nonlinear Networks [57.95116787699412]
We show at a physics level of rigor that Bayesian inference with a fully connected neural network is solvable.
We provide techniques to compute the model evidence and posterior to arbitrary order in $1/N$ and at arbitrary temperature.
arXiv Detail & Related papers (2024-05-26T17:08:04Z) - Asymptotics of Random Feature Regression Beyond the Linear Scaling
Regime [22.666759017118796]
Recent advances in machine learning have been achieved by using overparametrized models trained until near the training data.
How does model complexity and generalization depend on the number of parameters $p$?
In particular, RFRR exhibits an intuitive trade-off between approximation and generalization power.
arXiv Detail & Related papers (2024-03-13T00:59:25Z) - Multi-scale Feature Learning Dynamics: Insights for Double Descent [71.91871020059857]
We study the phenomenon of "double descent" of the generalization error.
We find that double descent can be attributed to distinct features being learned at different scales.
arXiv Detail & Related papers (2021-12-06T18:17:08Z) - Understanding the Under-Coverage Bias in Uncertainty Estimation [58.03725169462616]
quantile regression tends to emphunder-cover than the desired coverage level in reality.
We prove that quantile regression suffers from an inherent under-coverage bias.
Our theory reveals that this under-coverage bias stems from a certain high-dimensional parameter estimation error.
arXiv Detail & Related papers (2021-06-10T06:11:55Z) - Towards an Understanding of Benign Overfitting in Neural Networks [104.2956323934544]
Modern machine learning models often employ a huge number of parameters and are typically optimized to have zero training loss.
We examine how these benign overfitting phenomena occur in a two-layer neural network setting.
We show that it is possible for the two-layer ReLU network interpolator to achieve a near minimax-optimal learning rate.
arXiv Detail & Related papers (2021-06-06T19:08:53Z) - Fundamental tradeoffs between memorization and robustness in random
features and neural tangent regimes [15.76663241036412]
We prove for a large class of activation functions that, if the model memorizes even a fraction of the training, then its Sobolev-seminorm is lower-bounded.
Experiments reveal for the first time, (iv) a multiple-descent phenomenon in the robustness of the min-norm interpolator.
arXiv Detail & Related papers (2021-06-04T17:52:50Z) - Optimization Variance: Exploring Generalization Properties of DNNs [83.78477167211315]
The test error of a deep neural network (DNN) often demonstrates double descent.
We propose a novel metric, optimization variance (OV), to measure the diversity of model updates.
arXiv Detail & Related papers (2021-06-03T09:34:17Z) - Structure Learning in Inverse Ising Problems Using $\ell_2$-Regularized
Linear Estimator [8.89493507314525]
We show that despite the model mismatch, one can perfectly identify the network structure using naive linear regression without regularization.
We propose a two-stage estimator: In the first stage, the ridge regression is used and the estimates are pruned by a relatively small threshold.
This estimator with the appropriate regularization coefficient and thresholds is shown to achieve the perfect identification of the network structure even in $0M/N1$.
arXiv Detail & Related papers (2020-08-19T09:11:33Z) - The Interpolation Phase Transition in Neural Networks: Memorization and
Generalization under Lazy Training [10.72393527290646]
We study phenomena in the context of two-layers neural networks in the neural tangent (NT) regime.
We prove that as soon as $Ndgg n$, the test error is well approximated by one of kernel ridge regression with respect to the infinite-width kernel.
The latter is in turn well approximated by the error ridge regression, whereby the regularization parameter is increased by a self-induced' term related to the high-degree components of the activation function.
arXiv Detail & Related papers (2020-07-25T01:51:13Z) - A Random Matrix Analysis of Random Fourier Features: Beyond the Gaussian
Kernel, a Precise Phase Transition, and the Corresponding Double Descent [85.77233010209368]
This article characterizes the exacts of random Fourier feature (RFF) regression, in the realistic setting where the number of data samples $n$ is all large and comparable.
This analysis also provides accurate estimates of training and test regression errors for large $n,p,N$.
arXiv Detail & Related papers (2020-06-09T02:05:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.