Related papers: Can we avoid Double Descent in Deep Neural Networks?

Can we avoid Double Descent in Deep Neural Networks?

URL: http://arxiv.org/abs/2302.13259v4
Date: Tue, 4 Jul 2023 12:14:44 GMT
Title: Can we avoid Double Descent in Deep Neural Networks?
Authors: Victor Qu\'etu and Enzo Tartaglione
Abstract summary: Double descent has caught the attention of the deep learning community. It raises serious questions about the optimal model's size to maintain high generalization. Our work shows that the double descent phenomenon is potentially avoidable with proper conditioning of the learning problem.
Score: 3.1473798197405944
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Finding the optimal size of deep learning models is very actual and of broad impact, especially in energy-saving schemes. Very recently, an unexpected phenomenon, the ``double descent'', has caught the attention of the deep learning community. As the model's size grows, the performance gets first worse, and then goes back to improving. It raises serious questions about the optimal model's size to maintain high generalization: the model needs to be sufficiently over-parametrized, but adding too many parameters wastes training resources. Is it possible to find, in an efficient way, the best trade-off? Our work shows that the double descent phenomenon is potentially avoidable with proper conditioning of the learning problem, but a final answer is yet to be found. We empirically observe that there is hope to dodge the double descent in complex scenarios with proper regularization, as a simple $\ell_2$ regularization is already positively contributing to such a perspective.

Related papers

Toward Inference-optimal Mixture-of-Expert Large Language Models [55.96674056805708]
We study the scaling law of MoE-based large language models (LLMs) We find that MoEs with a few (4/8) experts are the most serving efficient solution under the same performance, but costs 2.5-3.5x more in training. We propose to amend the scaling law of MoE by introducing inference efficiency as another metric besides the validation loss.
arXiv Detail & Related papers (2024-04-03T16:33:42Z)
Understanding the Double Descent Phenomenon in Deep Learning [49.1574468325115]
This tutorial sets the classical statistical learning framework and introduces the double descent phenomenon. By looking at a number of examples, section 2 introduces inductive biases that appear to have a key role in double descent by selecting. section 3 explores the double descent with two linear models, and gives other points of view from recent related works.
arXiv Detail & Related papers (2024-03-15T16:51:24Z)
The Quest of Finding the Antidote to Sparse Double Descent [1.336445018915526]
As the model's sparsity increases, the performance first worsens, then improves, and finally deteriorates. Such a non-monotonic behavior raises serious questions about the optimal model's size to maintain high performance. We show that a simple $ell$ regularization method can help to mitigate this phenomenon but sacrifices the performance/sparsity.
arXiv Detail & Related papers (2023-08-31T09:56:40Z)
Learning to Jump: Thinning and Thickening Latent Counts for Generative Modeling [69.60713300418467]
Learning to jump is a general recipe for generative modeling of various types of data. We demonstrate when learning to jump is expected to perform comparably to learning to denoise, and when it is expected to perform better.
arXiv Detail & Related papers (2023-05-28T05:38:28Z)
Theoretical Characterization of the Generalization Performance of Overfitted Meta-Learning [70.52689048213398]
This paper studies the performance of overfitted meta-learning under a linear regression model with Gaussian features. We find new and interesting properties that do not exist in single-task linear regression. Our analysis suggests that benign overfitting is more significant and easier to observe when the noise and the diversity/fluctuation of the ground truth of each training task are large.
arXiv Detail & Related papers (2023-04-09T20:36:13Z)
DSD$^2$: Can We Dodge Sparse Double Descent and Compress the Neural Network Worry-Free? [7.793339267280654]
We propose a learning framework that avoids such a phenomenon and improves generalization. Second, we introduce an entropy measure providing more insights into the insurgence of this phenomenon. Third, we provide a comprehensive quantitative analysis of contingent factors such as re-initialization methods, model width and depth, and dataset noise.
arXiv Detail & Related papers (2023-03-02T12:54:12Z)
Sparse Double Descent: Where Network Pruning Aggravates Overfitting [8.425040193238777]
We report an unexpected sparse double descent phenomenon that, as we increase model sparsity via network pruning, test performance first gets worse. We propose a novel learning distance interpretation that the curve of $ell_2$ learning distance of sparse models may correlate with the sparse double descent curve well.
arXiv Detail & Related papers (2022-06-17T11:02:15Z)
The Low-Rank Simplicity Bias in Deep Networks [46.79964271742486]
We make a series of empirical observations that investigate and extend the hypothesis that deep networks are inductively biased to find solutions with lower effective rank embeddings. We show that our claim holds true on finite width linear and non-linear models on practical learning paradigms and show that on natural data, these are often the solutions that generalize well.
arXiv Detail & Related papers (2021-03-18T17:58:02Z)
Do Wider Neural Networks Really Help Adversarial Robustness? [92.8311752980399]
We show that the model robustness is closely related to the tradeoff between natural accuracy and perturbation stability. We propose a new Width Adjusted Regularization (WAR) method that adaptively enlarges $lambda$ on wide models.
arXiv Detail & Related papers (2020-10-03T04:46:17Z)
Overfitting in adversarially robust deep learning [86.11788847990783]
We show that overfitting to the training set does in fact harm robust performance to a very large degree in adversarially robust training. We also show that effects such as the double descent curve do still occur in adversarially trained models, yet fail to explain the observed overfitting.
arXiv Detail & Related papers (2020-02-26T15:40:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.