DSD$^2$: Can We Dodge Sparse Double Descent and Compress the Neural
Network Worry-Free?
- URL: http://arxiv.org/abs/2303.01213v3
- Date: Thu, 8 Feb 2024 08:26:47 GMT
- Title: DSD$^2$: Can We Dodge Sparse Double Descent and Compress the Neural
Network Worry-Free?
- Authors: Victor Qu\'etu, Enzo Tartaglione
- Abstract summary: We propose a learning framework that avoids such a phenomenon and improves generalization.
Second, we introduce an entropy measure providing more insights into the insurgence of this phenomenon.
Third, we provide a comprehensive quantitative analysis of contingent factors such as re-initialization methods, model width and depth, and dataset noise.
- Score: 7.793339267280654
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Neoteric works have shown that modern deep learning models can exhibit a
sparse double descent phenomenon. Indeed, as the sparsity of the model
increases, the test performance first worsens since the model is overfitting
the training data; then, the overfitting reduces, leading to an improvement in
performance, and finally, the model begins to forget critical information,
resulting in underfitting. Such a behavior prevents using traditional early
stop criteria. In this work, we have three key contributions. First, we propose
a learning framework that avoids such a phenomenon and improves generalization.
Second, we introduce an entropy measure providing more insights into the
insurgence of this phenomenon and enabling the use of traditional stop
criteria. Third, we provide a comprehensive quantitative analysis of contingent
factors such as re-initialization methods, model width and depth, and dataset
noise. The contributions are supported by empirical evidence in typical setups.
Our code is available at https://github.com/VGCQ/DSD2.
Related papers
- Addressing Concept Shift in Online Time Series Forecasting: Detect-then-Adapt [37.98336090671441]
Concept textbfDrift textbfDetection antextbfD textbfAdaptation (D3A)
It first detects drifting conception and then aggressively adapts the current model to the drifted concepts after the detection for rapid adaption.
It helps mitigate the data distribution gap, a critical factor contributing to train-test performance inconsistency.
arXiv Detail & Related papers (2024-03-22T04:44:43Z) - Understanding the Double Descent Phenomenon in Deep Learning [49.1574468325115]
This tutorial sets the classical statistical learning framework and introduces the double descent phenomenon.
By looking at a number of examples, section 2 introduces inductive biases that appear to have a key role in double descent by selecting.
section 3 explores the double descent with two linear models, and gives other points of view from recent related works.
arXiv Detail & Related papers (2024-03-15T16:51:24Z) - A Dynamical Model of Neural Scaling Laws [79.59705237659547]
We analyze a random feature model trained with gradient descent as a solvable model of network training and generalization.
Our theory shows how the gap between training and test loss can gradually build up over time due to repeated reuse of data.
arXiv Detail & Related papers (2024-02-02T01:41:38Z) - Relearning Forgotten Knowledge: on Forgetting, Overfit and Training-Free
Ensembles of DNNs [9.010643838773477]
We introduce a novel score for quantifying overfit, which monitors the forgetting rate of deep models on validation data.
We show that overfit can occur with and without a decrease in validation accuracy, and may be more common than previously appreciated.
We use our observations to construct a new ensemble method, based solely on the training history of a single network, which provides significant improvement without any additional cost in training time.
arXiv Detail & Related papers (2023-10-17T09:22:22Z) - LARA: A Light and Anti-overfitting Retraining Approach for Unsupervised
Time Series Anomaly Detection [49.52429991848581]
We propose a Light and Anti-overfitting Retraining Approach (LARA) for deep variational auto-encoder based time series anomaly detection methods (VAEs)
This work aims to make three novel contributions: 1) the retraining process is formulated as a convex problem and can converge at a fast rate as well as prevent overfitting; 2) designing a ruminate block, which leverages the historical data without the need to store them; and 3) mathematically proving that when fine-tuning the latent vector and reconstructed data, the linear formations can achieve the least adjusting errors between the ground truths and the fine-tuned ones.
arXiv Detail & Related papers (2023-10-09T12:36:16Z) - Sparse Double Descent: Where Network Pruning Aggravates Overfitting [8.425040193238777]
We report an unexpected sparse double descent phenomenon that, as we increase model sparsity via network pruning, test performance first gets worse.
We propose a novel learning distance interpretation that the curve of $ell_2$ learning distance of sparse models may correlate with the sparse double descent curve well.
arXiv Detail & Related papers (2022-06-17T11:02:15Z) - Adaptive Online Incremental Learning for Evolving Data Streams [4.3386084277869505]
The first major difficulty is concept drift, that is, the probability distribution in the streaming data would change as the data arrives.
The second major difficulty is catastrophic forgetting, that is, forgetting what we have learned before when learning new knowledge.
Our research builds on this observation and attempts to overcome these difficulties.
arXiv Detail & Related papers (2022-01-05T14:25:53Z) - When and how epochwise double descent happens [7.512375012141203]
An epochwise double descent' effect exists in which the generalization error initially drops, then rises, and finally drops again with increasing training time.
This presents a practical problem in that the amount of time required for training is long, and early stopping based on validation performance may result in suboptimal generalization.
We show that epochwise double descent requires a critical amount of noise to occur, but above a second critical noise level early stopping remains effective.
arXiv Detail & Related papers (2021-08-26T19:19:17Z) - Probabilistic Modeling for Human Mesh Recovery [73.11532990173441]
This paper focuses on the problem of 3D human reconstruction from 2D evidence.
We recast the problem as learning a mapping from the input to a distribution of plausible 3D poses.
arXiv Detail & Related papers (2021-08-26T17:55:11Z) - Spatio-Temporal Graph Contrastive Learning [49.132528449909316]
We propose a Spatio-Temporal Graph Contrastive Learning framework (STGCL) to tackle these issues.
We elaborate on four types of data augmentations which disturb data in terms of graph structure, time domain, and frequency domain.
Our framework is evaluated across three real-world datasets and four state-of-the-art models.
arXiv Detail & Related papers (2021-08-26T16:05:32Z) - Remembering for the Right Reasons: Explanations Reduce Catastrophic
Forgetting [100.75479161884935]
We propose a novel training paradigm called Remembering for the Right Reasons (RRR)
RRR stores visual model explanations for each example in the buffer and ensures the model has "the right reasons" for its predictions.
We demonstrate how RRR can be easily added to any memory or regularization-based approach and results in reduced forgetting.
arXiv Detail & Related papers (2020-10-04T10:05:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.