Theoretical Characterization of How Neural Network Pruning Affects its
Generalization
- URL: http://arxiv.org/abs/2301.00335v2
- Date: Thu, 5 Jan 2023 02:53:08 GMT
- Title: Theoretical Characterization of How Neural Network Pruning Affects its
Generalization
- Authors: Hongru Yang, Yingbin Liang, Xiaojie Guo, Lingfei Wu, Zhangyang Wang
- Abstract summary: This work makes the first attempt to study how different pruning fractions affect the model's gradient descent dynamics and generalization.
It is shown that as long as the pruning fraction is below a certain threshold, gradient descent can drive the training loss toward zero.
More surprisingly, the generalization bound gets better as the pruning fraction gets larger.
- Score: 131.1347309639727
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: It has been observed in practice that applying pruning-at-initialization
methods to neural networks and training the sparsified networks can not only
retain the testing performance of the original dense models, but also sometimes
even slightly boost the generalization performance. Theoretical understanding
for such experimental observations are yet to be developed. This work makes the
first attempt to study how different pruning fractions affect the model's
gradient descent dynamics and generalization. Specifically, this work considers
a classification task for overparameterized two-layer neural networks, where
the network is randomly pruned according to different rates at the
initialization. It is shown that as long as the pruning fraction is below a
certain threshold, gradient descent can drive the training loss toward zero and
the network exhibits good generalization performance. More surprisingly, the
generalization bound gets better as the pruning fraction gets larger. To
complement this positive result, this work further shows a negative result:
there exists a large pruning fraction such that while gradient descent is still
able to drive the training loss toward zero (by memorizing noise), the
generalization performance is no better than random guessing. This further
suggests that pruning can change the feature learning process, which leads to
the performance drop of the pruned neural network.
Related papers
- Benign Overfitting for Regression with Trained Two-Layer ReLU Networks [14.36840959836957]
We study the least-square regression problem with a two-layer fully-connected neural network, with ReLU activation function, trained by gradient flow.
Our first result is a generalization result, that requires no assumptions on the underlying regression function or the noise other than that they are bounded.
arXiv Detail & Related papers (2024-10-08T16:54:23Z) - Globally Optimal Training of Neural Networks with Threshold Activation
Functions [63.03759813952481]
We study weight decay regularized training problems of deep neural networks with threshold activations.
We derive a simplified convex optimization formulation when the dataset can be shattered at a certain layer of the network.
arXiv Detail & Related papers (2023-03-06T18:59:13Z) - On the optimization and generalization of overparameterized implicit
neural networks [25.237054775800164]
Implicit neural networks have become increasingly attractive in the machine learning community.
We show that global convergence is guaranteed, even if only the implicit layer is trained.
This paper investigates the generalization error for implicit neural networks.
arXiv Detail & Related papers (2022-09-30T16:19:46Z) - The Unreasonable Effectiveness of Random Pruning: Return of the Most
Naive Baseline for Sparse Training [111.15069968583042]
Random pruning is arguably the most naive way to attain sparsity in neural networks, but has been deemed uncompetitive by either post-training pruning or sparse training.
We empirically demonstrate that sparsely training a randomly pruned network from scratch can match the performance of its dense equivalent.
Our results strongly suggest there is larger-than-expected room for sparse training at scale, and the benefits of sparsity might be more universal beyond carefully designed pruning.
arXiv Detail & Related papers (2022-02-05T21:19:41Z) - Why Lottery Ticket Wins? A Theoretical Perspective of Sample Complexity
on Pruned Neural Networks [79.74580058178594]
We analyze the performance of training a pruned neural network by analyzing the geometric structure of the objective function.
We show that the convex region near a desirable model with guaranteed generalization enlarges as the neural network model is pruned.
arXiv Detail & Related papers (2021-10-12T01:11:07Z) - Understanding the Generalization of Adam in Learning Neural Networks
with Proper Regularization [118.50301177912381]
We show that Adam can converge to different solutions of the objective with provably different errors, even with weight decay globalization.
We show that if convex, and the weight decay regularization is employed, any optimization algorithms including Adam will converge to the same solution.
arXiv Detail & Related papers (2021-08-25T17:58:21Z) - Sparse Training via Boosting Pruning Plasticity with Neuroregeneration [79.78184026678659]
We study the effect of pruning throughout training from the perspective of pruning plasticity.
We design a novel gradual magnitude pruning (GMP) method, named gradual pruning with zero-cost neuroregeneration (GraNet) and its dynamic sparse training (DST) variant (GraNet-ST)
Perhaps most impressively, the latter for the first time boosts the sparse-to-sparse training performance over various dense-to-sparse methods by a large margin with ResNet-50 on ImageNet.
arXiv Detail & Related papers (2021-06-19T02:09:25Z) - Pre-interpolation loss behaviour in neural networks [3.8716601453641886]
We show that test loss does not increase overall, but only for a small minority of samples.
This effect seems to be mainly caused by increased parameter values relating to the correctly processed sample features.
Our findings contribute to the practical understanding of a common behaviour of deep neural networks.
arXiv Detail & Related papers (2021-03-14T18:08:59Z) - Gradient Boosting Neural Networks: GrowNet [9.0491536808974]
A novel gradient boosting framework is proposed where shallow neural networks are employed as weak learners''
A fully corrective step is incorporated to remedy the pitfall of greedy function approximation of classic gradient boosting decision tree.
The proposed model rendered outperforming results against state-of-the-art boosting methods in all three tasks on multiple datasets.
arXiv Detail & Related papers (2020-02-19T03:02:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.