A Brief Prehistory of Double Descent
- URL: http://arxiv.org/abs/2004.04328v1
- Date: Tue, 7 Apr 2020 09:41:24 GMT
- Title: A Brief Prehistory of Double Descent
- Authors: Marco Loog, Tom Viering, Alexander Mey, Jesse H. Krijthe, David M.J.
Tax
- Abstract summary: Belkin et al. illustrate and discuss the shape of risk curves in the context of modern high-complexity learners.
With increasing $N$, the risk initially decreases, attains a minimum, and then increases until $N$ equals $n$, where the training data is fitted perfectly.
- Score: 75.37825440319975
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In their thought-provoking paper [1], Belkin et al. illustrate and discuss
the shape of risk curves in the context of modern high-complexity learners.
Given a fixed training sample size $n$, such curves show the risk of a learner
as a function of some (approximate) measure of its complexity $N$. With $N$ the
number of features, these curves are also referred to as feature curves. A
salient observation in [1] is that these curves can display, what they call,
double descent: with increasing $N$, the risk initially decreases, attains a
minimum, and then increases until $N$ equals $n$, where the training data is
fitted perfectly. Increasing $N$ even further, the risk decreases a second and
final time, creating a peak at $N=n$. This twofold descent may come as a
surprise, but as opposed to what [1] reports, it has not been overlooked
historically. Our letter draws attention to some original, earlier findings, of
interest to contemporary machine learning.
Related papers
- Learning to Price Homogeneous Data [6.288169915425957]
We develop novel discretization schemes to approximate any pricing curve.
Our online algorithms build on classical algorithms such as UCB and FTPL.
Using the improved discretization schemes, we are able to achieve $tildeO(msqrtT)$ regret in the setting and $tildeO(m3/2sqrtT)$ regret in the adversarial setting.
arXiv Detail & Related papers (2024-07-07T20:02:52Z) - Optimal Excess Risk Bounds for Empirical Risk Minimization on $p$-Norm Linear Regression [19.31269916674961]
We show that, in the realizable case, under no moment assumptions, $O(d)$ samples are enough to exactly recover the target.
We extend this result to the case $p in (1, 2)$ under mild assumptions that guarantee the existence of the Hessian of the risk at its minimizer.
arXiv Detail & Related papers (2023-10-19T03:21:28Z) - Causal Bandits for Linear Structural Equation Models [58.2875460517691]
This paper studies the problem of designing an optimal sequence of interventions in a causal graphical model.
It is assumed that the graph's structure is known and has $N$ nodes.
Two algorithms are proposed for the frequentist (UCB-based) and Bayesian settings.
arXiv Detail & Related papers (2022-08-26T16:21:31Z) - Overparametrized linear dimensionality reductions: From projection
pursuit to two-layer neural networks [10.368585938419619]
Given a cloud of $n$ data points in $mathbbRd$, consider all projections onto $m$-dimensional subspaces of $mathbbRd$.
What does this collection of probability distributions look like when $n,d$ grow large?
Denoting by $mathscrF_m, alpha$ the set of probability distributions in $mathbbRm$ that arise as low-dimensional projections in this limit, we establish new inner and outer bounds on $mathscrF_
arXiv Detail & Related papers (2022-06-14T00:07:33Z) - High-dimensional Asymptotics of Feature Learning: How One Gradient Step
Improves the Representation [89.21686761957383]
We study the first gradient descent step on the first-layer parameters $boldsymbolW$ in a two-layer network.
Our results demonstrate that even one step can lead to a considerable advantage over random features.
arXiv Detail & Related papers (2022-05-03T12:09:59Z) - Towards an Understanding of Benign Overfitting in Neural Networks [104.2956323934544]
Modern machine learning models often employ a huge number of parameters and are typically optimized to have zero training loss.
We examine how these benign overfitting phenomena occur in a two-layer neural network setting.
We show that it is possible for the two-layer ReLU network interpolator to achieve a near minimax-optimal learning rate.
arXiv Detail & Related papers (2021-06-06T19:08:53Z) - Sharp Statistical Guarantees for Adversarially Robust Gaussian
Classification [54.22421582955454]
We provide the first result of the optimal minimax guarantees for the excess risk for adversarially robust classification.
Results are stated in terms of the Adversarial Signal-to-Noise Ratio (AdvSNR), which generalizes a similar notion for standard linear classification to the adversarial setting.
arXiv Detail & Related papers (2020-06-29T21:06:52Z) - Risk-Sensitive Reinforcement Learning: Near-Optimal Risk-Sample Tradeoff
in Regret [115.85354306623368]
We study risk-sensitive reinforcement learning in episodic Markov decision processes with unknown transition kernels.
We propose two provably efficient model-free algorithms, Risk-Sensitive Value Iteration (RSVI) and Risk-Sensitive Q-learning (RSQ)
We prove that RSVI attains an $tildeObig(lambda(|beta| H2) cdot sqrtH3 S2AT big)$ regret, while RSQ attains an $tildeObig(lambda
arXiv Detail & Related papers (2020-06-22T19:28:26Z) - Triple descent and the two kinds of overfitting: Where & why do they
appear? [16.83019116094311]
We show that despite their apparent similarity, both peaks can co-exist when neural networks are applied to noisy regression tasks.
The relative size of the peaks is then governed by the degree of nonlinearity of the activation function.
We show that this peak is implicitly regularized by the nonlinearity, which is why it only becomes salient at high noise.
arXiv Detail & Related papers (2020-06-05T15:24:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.