Why Lottery Ticket Wins? A Theoretical Perspective of Sample Complexity
on Pruned Neural Networks
- URL: http://arxiv.org/abs/2110.05667v1
- Date: Tue, 12 Oct 2021 01:11:07 GMT
- Title: Why Lottery Ticket Wins? A Theoretical Perspective of Sample Complexity
on Pruned Neural Networks
- Authors: Shuai Zhang, Meng Wang, Sijia Liu, Pin-Yu Chen, Jinjun Xiong
- Abstract summary: We analyze the performance of training a pruned neural network by analyzing the geometric structure of the objective function.
We show that the convex region near a desirable model with guaranteed generalization enlarges as the neural network model is pruned.
- Score: 79.74580058178594
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The \textit{lottery ticket hypothesis} (LTH) states that learning on a
properly pruned network (the \textit{winning ticket}) improves test accuracy
over the original unpruned network. Although LTH has been justified empirically
in a broad range of deep neural network (DNN) involved applications like
computer vision and natural language processing, the theoretical validation of
the improved generalization of a winning ticket remains elusive. To the best of
our knowledge, our work, for the first time, characterizes the performance of
training a pruned neural network by analyzing the geometric structure of the
objective function and the sample complexity to achieve zero generalization
error. We show that the convex region near a desirable model with guaranteed
generalization enlarges as the neural network model is pruned, indicating the
structural importance of a winning ticket. Moreover, when the algorithm for
training a pruned neural network is specified as an (accelerated) stochastic
gradient descent algorithm, we theoretically show that the number of samples
required for achieving zero generalization error is proportional to the number
of the non-pruned weights in the hidden layer. With a fixed number of samples,
training a pruned neural network enjoys a faster convergence rate to the
desired model than training the original unpruned one, providing a formal
justification of the improved generalization of the winning ticket. Our
theoretical results are acquired from learning a pruned neural network of one
hidden layer, while experimental results are further provided to justify the
implications in pruning multi-layer neural networks.
Related papers
- Joint Edge-Model Sparse Learning is Provably Efficient for Graph Neural
Networks [89.28881869440433]
This paper provides the first theoretical characterization of joint edge-model sparse learning for graph neural networks (GNNs)
It proves analytically that both sampling important nodes and pruning neurons with the lowest-magnitude can reduce the sample complexity and improve convergence without compromising the test accuracy.
arXiv Detail & Related papers (2023-02-06T16:54:20Z) - Theoretical Characterization of How Neural Network Pruning Affects its
Generalization [131.1347309639727]
This work makes the first attempt to study how different pruning fractions affect the model's gradient descent dynamics and generalization.
It is shown that as long as the pruning fraction is below a certain threshold, gradient descent can drive the training loss toward zero.
More surprisingly, the generalization bound gets better as the pruning fraction gets larger.
arXiv Detail & Related papers (2023-01-01T03:10:45Z) - With Greater Distance Comes Worse Performance: On the Perspective of
Layer Utilization and Model Generalization [3.6321778403619285]
Generalization of deep neural networks remains one of the main open problems in machine learning.
Early layers generally learn representations relevant to performance on both training data and testing data.
Deeper layers only minimize training risks and fail to generalize well with testing or mislabeled data.
arXiv Detail & Related papers (2022-01-28T05:26:32Z) - A Kernel-Expanded Stochastic Neural Network [10.837308632004644]
Deep neural network often gets trapped into a local minimum in training.
New kernel-expanded neural network (K-StoNet) model reformulates the network as a latent variable model.
Model can be easily trained using the imputationregularized optimization (IRO) algorithm.
arXiv Detail & Related papers (2022-01-14T06:42:42Z) - Neural Capacitance: A New Perspective of Neural Network Selection via
Edge Dynamics [85.31710759801705]
Current practice requires expensive computational costs in model training for performance prediction.
We propose a novel framework for neural network selection by analyzing the governing dynamics over synaptic connections (edges) during training.
Our framework is built on the fact that back-propagation during neural network training is equivalent to the dynamical evolution of synaptic connections.
arXiv Detail & Related papers (2022-01-11T20:53:15Z) - Mitigating Performance Saturation in Neural Marked Point Processes:
Architectures and Loss Functions [50.674773358075015]
We propose a simple graph-based network structure called GCHP, which utilizes only graph convolutional layers.
We show that GCHP can significantly reduce training time and the likelihood ratio loss with interarrival time probability assumptions can greatly improve the model performance.
arXiv Detail & Related papers (2021-07-07T16:59:14Z) - Towards an Understanding of Benign Overfitting in Neural Networks [104.2956323934544]
Modern machine learning models often employ a huge number of parameters and are typically optimized to have zero training loss.
We examine how these benign overfitting phenomena occur in a two-layer neural network setting.
We show that it is possible for the two-layer ReLU network interpolator to achieve a near minimax-optimal learning rate.
arXiv Detail & Related papers (2021-06-06T19:08:53Z) - Persistent Homology Captures the Generalization of Neural Networks
Without A Validation Set [0.0]
We suggest studying the training of neural networks with Algebraic Topology, specifically Persistent Homology.
Using simplicial complex representations of neural networks, we study the PH diagram distance evolution on the neural network learning process.
Results show that the PH diagram distance between consecutive neural network states correlates with the validation accuracy.
arXiv Detail & Related papers (2021-05-31T09:17:31Z) - Compressive Sensing and Neural Networks from a Statistical Learning
Perspective [4.561032960211816]
We present a generalization error analysis for a class of neural networks suitable for sparse reconstruction from few linear measurements.
Under realistic conditions, the generalization error scales only logarithmically in the number of layers, and at most linear in number of measurements.
arXiv Detail & Related papers (2020-10-29T15:05:43Z) - A Revision of Neural Tangent Kernel-based Approaches for Neural Networks [34.75076385561115]
We use the neural tangent kernel to show that networks can fit any finite training sample perfectly.
A simple and analytic kernel function was derived as indeed equivalent to a fully-trained network.
Our tighter analysis resolves the scaling problem and enables the validation of the original NTK-based results.
arXiv Detail & Related papers (2020-07-02T05:07:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.