Understanding the Role of Optimization in Double Descent
- URL: http://arxiv.org/abs/2312.03951v1
- Date: Wed, 6 Dec 2023 23:29:00 GMT
- Title: Understanding the Role of Optimization in Double Descent
- Authors: Chris Yuhao Liu, Jeffrey Flanigan
- Abstract summary: We propose a simple optimization-based explanation for why double descent sometimes occurs weakly or not at all.
To the best of our knowledge, we are the first to demonstrate that many disparate factors contributing to model-wise double descent are unified from the viewpoint of optimization.
Our results suggest the following implication: Double descent is unlikely to be a problem for real-world machine learning setups.
- Score: 8.010193718024347
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The phenomenon of model-wise double descent, where the test error peaks and
then reduces as the model size increases, is an interesting topic that has
attracted the attention of researchers due to the striking observed gap between
theory and practice \citep{Belkin2018ReconcilingMM}. Additionally, while double
descent has been observed in various tasks and architectures, the peak of
double descent can sometimes be noticeably absent or diminished, even without
explicit regularization, such as weight decay and early stopping. In this
paper, we investigate this intriguing phenomenon from the optimization
perspective and propose a simple optimization-based explanation for why double
descent sometimes occurs weakly or not at all. To the best of our knowledge, we
are the first to demonstrate that many disparate factors contributing to
model-wise double descent (initialization, normalization, batch size, learning
rate, optimization algorithm) are unified from the viewpoint of optimization:
model-wise double descent is observed if and only if the optimizer can find a
sufficiently low-loss minimum. These factors directly affect the condition
number of the optimization problem or the optimizer and thus affect the final
minimum found by the optimizer, reducing or increasing the height of the double
descent peak. We conduct a series of controlled experiments on random feature
models and two-layer neural networks under various optimization settings,
demonstrating this optimization-based unified view. Our results suggest the
following implication: Double descent is unlikely to be a problem for
real-world machine learning setups. Additionally, our results help explain the
gap between weak double descent peaks in practice and strong peaks observable
in carefully designed setups.
Related papers
- The Epochal Sawtooth Effect: Unveiling Training Loss Oscillations in Adam and Other Optimizers [8.770864706004472]
We identify and analyze a recurring training loss pattern, which we term the textitEpochal Sawtooth Effect (ESE)
This pattern is characterized by a sharp drop in loss at the beginning of each epoch, followed by a gradual increase, resulting in a sawtooth-shaped loss curve.
We provide an in-depth explanation of the underlying mechanisms that lead to the Epochal Sawtooth Effect.
arXiv Detail & Related papers (2024-10-14T00:51:21Z) - Understanding the Double Descent Phenomenon in Deep Learning [49.1574468325115]
This tutorial sets the classical statistical learning framework and introduces the double descent phenomenon.
By looking at a number of examples, section 2 introduces inductive biases that appear to have a key role in double descent by selecting.
section 3 explores the double descent with two linear models, and gives other points of view from recent related works.
arXiv Detail & Related papers (2024-03-15T16:51:24Z) - Stable Nonconvex-Nonconcave Training via Linear Interpolation [51.668052890249726]
This paper presents a theoretical analysis of linearahead as a principled method for stabilizing (large-scale) neural network training.
We argue that instabilities in the optimization process are often caused by the nonmonotonicity of the loss landscape and show how linear can help by leveraging the theory of nonexpansive operators.
arXiv Detail & Related papers (2023-10-20T12:45:12Z) - Hybrid Predictive Coding: Inferring, Fast and Slow [62.997667081978825]
We propose a hybrid predictive coding network that combines both iterative and amortized inference in a principled manner.
We demonstrate that our model is inherently sensitive to its uncertainty and adaptively balances balances to obtain accurate beliefs using minimum computational expense.
arXiv Detail & Related papers (2022-04-05T12:52:45Z) - Multi-scale Feature Learning Dynamics: Insights for Double Descent [71.91871020059857]
We study the phenomenon of "double descent" of the generalization error.
We find that double descent can be attributed to distinct features being learned at different scales.
arXiv Detail & Related papers (2021-12-06T18:17:08Z) - On the Role of Optimization in Double Descent: A Least Squares Study [30.44215064390409]
We show an excess risk bound for the descent gradient solution of the least squares objective.
We find that in case of noiseless regression, double descent is explained solely by optimization-related quantities.
We empirically explore if our predictions hold for neural networks.
arXiv Detail & Related papers (2021-07-27T09:13:11Z) - Nonasymptotic theory for two-layer neural networks: Beyond the
bias-variance trade-off [10.182922771556742]
We present a nonasymptotic generalization theory for two-layer neural networks with ReLU activation function.
We show that overparametrized random feature models suffer from the curse of dimensionality and thus are suboptimal.
arXiv Detail & Related papers (2021-06-09T03:52:18Z) - Towards an Understanding of Benign Overfitting in Neural Networks [104.2956323934544]
Modern machine learning models often employ a huge number of parameters and are typically optimized to have zero training loss.
We examine how these benign overfitting phenomena occur in a two-layer neural network setting.
We show that it is possible for the two-layer ReLU network interpolator to achieve a near minimax-optimal learning rate.
arXiv Detail & Related papers (2021-06-06T19:08:53Z) - Double Descent Optimization Pattern and Aliasing: Caveats of Noisy
Labels [1.4424394176890545]
This work confirms that double descent occurs with small datasets and noisy labels.
We show that increasing the learning rate can create analiasing effect that masks the double descent pattern without suppressing it.
We show that they translate to a real world application: the forecast of events in epileptic patients from continuous electroencephalographic recordings.
arXiv Detail & Related papers (2021-06-03T19:41:40Z) - Cogradient Descent for Bilinear Optimization [124.45816011848096]
We introduce a Cogradient Descent algorithm (CoGD) to address the bilinear problem.
We solve one variable by considering its coupling relationship with the other, leading to a synchronous gradient descent.
Our algorithm is applied to solve problems with one variable under the sparsity constraint.
arXiv Detail & Related papers (2020-06-16T13:41:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.