On the Role of Optimization in Double Descent: A Least Squares Study
- URL: http://arxiv.org/abs/2107.12685v1
- Date: Tue, 27 Jul 2021 09:13:11 GMT
- Title: On the Role of Optimization in Double Descent: A Least Squares Study
- Authors: Ilja Kuzborskij, Csaba Szepesv\'ari, Omar Rivasplata, Amal
Rannen-Triki, Razvan Pascanu
- Abstract summary: We show an excess risk bound for the descent gradient solution of the least squares objective.
We find that in case of noiseless regression, double descent is explained solely by optimization-related quantities.
We empirically explore if our predictions hold for neural networks.
- Score: 30.44215064390409
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Empirically it has been observed that the performance of deep neural networks
steadily improves as we increase model size, contradicting the classical view
on overfitting and generalization. Recently, the double descent phenomena has
been proposed to reconcile this observation with theory, suggesting that the
test error has a second descent when the model becomes sufficiently
overparameterized, as the model size itself acts as an implicit regularizer. In
this paper we add to the growing body of work in this space, providing a
careful study of learning dynamics as a function of model size for the least
squares scenario. We show an excess risk bound for the gradient descent
solution of the least squares objective. The bound depends on the smallest
non-zero eigenvalue of the covariance matrix of the input features, via a
functional form that has the double descent behavior. This gives a new
perspective on the double descent curves reported in the literature. Our
analysis of the excess risk allows to decouple the effect of optimization and
generalization error. In particular, we find that in case of noiseless
regression, double descent is explained solely by optimization-related
quantities, which was missed in studies focusing on the Moore-Penrose
pseudoinverse solution. We believe that our derivation provides an alternative
view compared to existing work, shedding some light on a possible cause of this
phenomena, at least in the considered least squares setting. We empirically
explore if our predictions hold for neural networks, in particular whether the
covariance of intermediary hidden activations has a similar behavior as the one
predicted by our derivations.
Related papers
- Understanding the Double Descent Phenomenon in Deep Learning [49.1574468325115]
This tutorial sets the classical statistical learning framework and introduces the double descent phenomenon.
By looking at a number of examples, section 2 introduces inductive biases that appear to have a key role in double descent by selecting.
section 3 explores the double descent with two linear models, and gives other points of view from recent related works.
arXiv Detail & Related papers (2024-03-15T16:51:24Z) - The Surprising Harmfulness of Benign Overfitting for Adversarial
Robustness [13.120373493503772]
We prove a surprising result that even if the ground truth itself is robust to adversarial examples, the benignly overfitted model is benign in terms of the standard'' out-of-sample risk objective.
Our finding provides theoretical insights into the puzzling phenomenon observed in practice, where the true target function (e.g., human) is robust against adverasrial attack, while beginly overfitted neural networks lead to models that are not robust.
arXiv Detail & Related papers (2024-01-19T15:40:46Z) - Understanding the Role of Optimization in Double Descent [8.010193718024347]
We propose a simple optimization-based explanation for why double descent sometimes occurs weakly or not at all.
To the best of our knowledge, we are the first to demonstrate that many disparate factors contributing to model-wise double descent are unified from the viewpoint of optimization.
Our results suggest the following implication: Double descent is unlikely to be a problem for real-world machine learning setups.
arXiv Detail & Related papers (2023-12-06T23:29:00Z) - A U-turn on Double Descent: Rethinking Parameter Counting in Statistical
Learning [68.76846801719095]
We show that double descent appears exactly when and where it occurs, and that its location is not inherently tied to the threshold p=n.
This provides a resolution to tensions between double descent and statistical intuition.
arXiv Detail & Related papers (2023-10-29T12:05:39Z) - Origins of Low-dimensional Adversarial Perturbations [17.17170592140042]
We study the phenomenon of low-dimensional adversarial perturbations in classification.
The goal is to fool the classifier into flipping its decision on a nonzero fraction of inputs from a designated class.
We compute lowerbounds for the fooling rate of any subspace.
arXiv Detail & Related papers (2022-03-25T17:02:49Z) - On the Benefits of Large Learning Rates for Kernel Methods [110.03020563291788]
We show that a phenomenon can be precisely characterized in the context of kernel methods.
We consider the minimization of a quadratic objective in a separable Hilbert space, and show that with early stopping, the choice of learning rate influences the spectral decomposition of the obtained solution.
arXiv Detail & Related papers (2022-02-28T13:01:04Z) - Multi-scale Feature Learning Dynamics: Insights for Double Descent [71.91871020059857]
We study the phenomenon of "double descent" of the generalization error.
We find that double descent can be attributed to distinct features being learned at different scales.
arXiv Detail & Related papers (2021-12-06T18:17:08Z) - The Interplay Between Implicit Bias and Benign Overfitting in Two-Layer
Linear Networks [51.1848572349154]
neural network models that perfectly fit noisy data can generalize well to unseen test data.
We consider interpolating two-layer linear neural networks trained with gradient flow on the squared loss and derive bounds on the excess risk.
arXiv Detail & Related papers (2021-08-25T22:01:01Z) - Nonasymptotic theory for two-layer neural networks: Beyond the
bias-variance trade-off [10.182922771556742]
We present a nonasymptotic generalization theory for two-layer neural networks with ReLU activation function.
We show that overparametrized random feature models suffer from the curse of dimensionality and thus are suboptimal.
arXiv Detail & Related papers (2021-06-09T03:52:18Z) - A Bayesian Perspective on Training Speed and Model Selection [51.15664724311443]
We show that a measure of a model's training speed can be used to estimate its marginal likelihood.
We verify our results in model selection tasks for linear models and for the infinite-width limit of deep neural networks.
Our results suggest a promising new direction towards explaining why neural networks trained with gradient descent are biased towards functions that generalize well.
arXiv Detail & Related papers (2020-10-27T17:56:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.