How much pre-training is enough to discover a good subnetwork?
- URL: http://arxiv.org/abs/2108.00259v3
- Date: Tue, 22 Aug 2023 18:13:50 GMT
- Title: How much pre-training is enough to discover a good subnetwork?
- Authors: Cameron R. Wolfe, Fangshuo Liao, Qihan Wang, Junhyung Lyle Kim,
Anastasios Kyrillidis
- Abstract summary: We mathematically analyze the amount of dense network pre-training needed for a pruned network to perform well.
We find a simple theoretical bound in the number of gradient descent pre-training iterations on a two-layer, fully-connected network.
Experiments with larger datasets require more pre-training forworks obtained via pruning to perform well.
- Score: 10.699603774240853
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Neural network pruning is useful for discovering efficient, high-performing
subnetworks within pre-trained, dense network architectures. More often than
not, it involves a three-step process -- pre-training, pruning, and re-training
-- that is computationally expensive, as the dense model must be fully
pre-trained. While previous work has revealed through experiments the
relationship between the amount of pre-training and the performance of the
pruned network, a theoretical characterization of such dependency is still
missing. Aiming to mathematically analyze the amount of dense network
pre-training needed for a pruned network to perform well, we discover a simple
theoretical bound in the number of gradient descent pre-training iterations on
a two-layer, fully-connected network, beyond which pruning via greedy forward
selection [61] yields a subnetwork that achieves good training error.
Interestingly, this threshold is shown to be logarithmically dependent upon the
size of the dataset, meaning that experiments with larger datasets require more
pre-training for subnetworks obtained via pruning to perform well. Lastly, we
empirically validate our theoretical results on a multi-layer perceptron
trained on MNIST.
Related papers
- Learning to Weight Samples for Dynamic Early-exiting Networks [35.03752825893429]
Early exiting is an effective paradigm for improving the inference efficiency of deep networks.
Our work proposes to adopt a weight prediction network to weight the loss of different training samples at each exit.
We show that the proposed weighting mechanism consistently improves the trade-off between classification accuracy and inference efficiency.
arXiv Detail & Related papers (2022-09-17T10:46:32Z) - The Unreasonable Effectiveness of Random Pruning: Return of the Most
Naive Baseline for Sparse Training [111.15069968583042]
Random pruning is arguably the most naive way to attain sparsity in neural networks, but has been deemed uncompetitive by either post-training pruning or sparse training.
We empirically demonstrate that sparsely training a randomly pruned network from scratch can match the performance of its dense equivalent.
Our results strongly suggest there is larger-than-expected room for sparse training at scale, and the benefits of sparsity might be more universal beyond carefully designed pruning.
arXiv Detail & Related papers (2022-02-05T21:19:41Z) - How does unlabeled data improve generalization in self-training? A
one-hidden-layer theoretical analysis [93.37576644429578]
This work establishes the first theoretical analysis for the known iterative self-training paradigm.
We prove the benefits of unlabeled data in both training convergence and generalization ability.
Experiments from shallow neural networks to deep neural networks are also provided to justify the correctness of our established theoretical insights on self-training.
arXiv Detail & Related papers (2022-01-21T02:16:52Z) - An Experimental Study of the Impact of Pre-training on the Pruning of a
Convolutional Neural Network [0.0]
In recent years, deep neural networks have known a wide success in various application domains.
Deep neural networks usually involve a large number of parameters, which correspond to the weights of the network.
The pruning methods notably attempt to reduce the size of the parameter set, by identifying and removing the irrelevant weights.
arXiv Detail & Related papers (2021-12-15T16:02:15Z) - Why Lottery Ticket Wins? A Theoretical Perspective of Sample Complexity
on Pruned Neural Networks [79.74580058178594]
We analyze the performance of training a pruned neural network by analyzing the geometric structure of the objective function.
We show that the convex region near a desirable model with guaranteed generalization enlarges as the neural network model is pruned.
arXiv Detail & Related papers (2021-10-12T01:11:07Z) - Simultaneous Training of Partially Masked Neural Networks [67.19481956584465]
We show that it is possible to train neural networks in such a way that a predefined 'core' subnetwork can be split-off from the trained full network with remarkable good performance.
We show that training a Transformer with a low-rank core gives a low-rank model with superior performance than when training the low-rank model alone.
arXiv Detail & Related papers (2021-06-16T15:57:51Z) - Fitting the Search Space of Weight-sharing NAS with Graph Convolutional
Networks [100.14670789581811]
We train a graph convolutional network to fit the performance of sampled sub-networks.
With this strategy, we achieve a higher rank correlation coefficient in the selected set of candidates.
arXiv Detail & Related papers (2020-04-17T19:12:39Z) - Towards Practical Lottery Ticket Hypothesis for Adversarial Training [78.30684998080346]
We show there exists a subset of the aforementioned sub-networks that converge significantly faster during the training process.
As a practical application of our findings, we demonstrate that such sub-networks can help in cutting down the total time of adversarial training.
arXiv Detail & Related papers (2020-03-06T03:11:52Z) - NeuroFabric: Identifying Ideal Topologies for Training A Priori Sparse
Networks [2.398608007786179]
Long training times of deep neural networks are a bottleneck in machine learning research.
We provide a theoretical foundation for the choice of intra-layer topology.
We show that seemingly similar topologies can often have a large difference in attainable accuracy.
arXiv Detail & Related papers (2020-02-19T18:29:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.