Sparse Double Descent: Where Network Pruning Aggravates Overfitting
- URL: http://arxiv.org/abs/2206.08684v1
- Date: Fri, 17 Jun 2022 11:02:15 GMT
- Title: Sparse Double Descent: Where Network Pruning Aggravates Overfitting
- Authors: Zheng He, Zeke Xie, Quanzhi Zhu, Zengchang Qin
- Abstract summary: We report an unexpected sparse double descent phenomenon that, as we increase model sparsity via network pruning, test performance first gets worse.
We propose a novel learning distance interpretation that the curve of $ell_2$ learning distance of sparse models may correlate with the sparse double descent curve well.
- Score: 8.425040193238777
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: People usually believe that network pruning not only reduces the
computational cost of deep networks, but also prevents overfitting by
decreasing model capacity. However, our work surprisingly discovers that
network pruning sometimes even aggravates overfitting. We report an unexpected
sparse double descent phenomenon that, as we increase model sparsity via
network pruning, test performance first gets worse (due to overfitting), then
gets better (due to relieved overfitting), and gets worse at last (due to
forgetting useful information). While recent studies focused on the deep double
descent with respect to model overparameterization, they failed to recognize
that sparsity may also cause double descent. In this paper, we have three main
contributions. First, we report the novel sparse double descent phenomenon
through extensive experiments. Second, for this phenomenon, we propose a novel
learning distance interpretation that the curve of $\ell_{2}$ learning distance
of sparse models (from initialized parameters to final parameters) may
correlate with the sparse double descent curve well and reflect generalization
better than minima flatness. Third, in the context of sparse double descent, a
winning ticket in the lottery ticket hypothesis surprisingly may not always
win.
Related papers
- Understanding the Double Descent Phenomenon in Deep Learning [49.1574468325115]
This tutorial sets the classical statistical learning framework and introduces the double descent phenomenon.
By looking at a number of examples, section 2 introduces inductive biases that appear to have a key role in double descent by selecting.
section 3 explores the double descent with two linear models, and gives other points of view from recent related works.
arXiv Detail & Related papers (2024-03-15T16:51:24Z) - A U-turn on Double Descent: Rethinking Parameter Counting in Statistical
Learning [68.76846801719095]
We show that double descent appears exactly when and where it occurs, and that its location is not inherently tied to the threshold p=n.
This provides a resolution to tensions between double descent and statistical intuition.
arXiv Detail & Related papers (2023-10-29T12:05:39Z) - Double Descent Demystified: Identifying, Interpreting & Ablating the
Sources of a Deep Learning Puzzle [12.00962791565144]
Double descent is a surprising phenomenon in machine learning.
As the number of model parameters grows relative to the number of data, test error drops.
arXiv Detail & Related papers (2023-03-24T17:03:40Z) - DSD$^2$: Can We Dodge Sparse Double Descent and Compress the Neural
Network Worry-Free? [7.793339267280654]
We propose a learning framework that avoids such a phenomenon and improves generalization.
Second, we introduce an entropy measure providing more insights into the insurgence of this phenomenon.
Third, we provide a comprehensive quantitative analysis of contingent factors such as re-initialization methods, model width and depth, and dataset noise.
arXiv Detail & Related papers (2023-03-02T12:54:12Z) - Theoretical Characterization of How Neural Network Pruning Affects its
Generalization [131.1347309639727]
This work makes the first attempt to study how different pruning fractions affect the model's gradient descent dynamics and generalization.
It is shown that as long as the pruning fraction is below a certain threshold, gradient descent can drive the training loss toward zero.
More surprisingly, the generalization bound gets better as the pruning fraction gets larger.
arXiv Detail & Related papers (2023-01-01T03:10:45Z) - Improved Convergence Guarantees for Shallow Neural Networks [91.3755431537592]
We prove convergence of depth 2 neural networks, trained via gradient descent, to a global minimum.
Our model has the following features: regression with quadratic loss function, fully connected feedforward architecture, RelU activations, Gaussian data instances, adversarial labels.
They strongly suggest that, at least in our model, the convergence phenomenon extends well beyond the NTK regime''
arXiv Detail & Related papers (2022-12-05T14:47:52Z) - Lassoed Tree Boosting [53.56229983630983]
We prove that a gradient boosted tree algorithm with early stopping faster than $n-1/4$ L2 convergence in the large nonparametric space of cadlag functions of bounded sectional variation.
Our convergence proofs are based on a novel, general theorem on early stopping with empirical loss minimizers of nested Donsker classes.
arXiv Detail & Related papers (2022-05-22T00:34:41Z) - On the Role of Optimization in Double Descent: A Least Squares Study [30.44215064390409]
We show an excess risk bound for the descent gradient solution of the least squares objective.
We find that in case of noiseless regression, double descent is explained solely by optimization-related quantities.
We empirically explore if our predictions hold for neural networks.
arXiv Detail & Related papers (2021-07-27T09:13:11Z) - Towards an Understanding of Benign Overfitting in Neural Networks [104.2956323934544]
Modern machine learning models often employ a huge number of parameters and are typically optimized to have zero training loss.
We examine how these benign overfitting phenomena occur in a two-layer neural network setting.
We show that it is possible for the two-layer ReLU network interpolator to achieve a near minimax-optimal learning rate.
arXiv Detail & Related papers (2021-06-06T19:08:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.