On the Predictability of Pruning Across Scales
- URL: http://arxiv.org/abs/2006.10621v3
- Date: Sun, 4 Jul 2021 02:51:24 GMT
- Title: On the Predictability of Pruning Across Scales
- Authors: Jonathan S. Rosenfeld, Jonathan Frankle, Michael Carbin, Nir Shavit
- Abstract summary: We show that the error of magnitude-pruned networks empirically follows a scaling law with interpretable coefficients that depend on the architecture and task.
As neural networks become ever larger and costlier to train, our findings suggest a framework for reasoning conceptually and analytically about a standard method for unstructured pruning.
- Score: 29.94870276983399
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We show that the error of iteratively magnitude-pruned networks empirically
follows a scaling law with interpretable coefficients that depend on the
architecture and task. We functionally approximate the error of the pruned
networks, showing it is predictable in terms of an invariant tying width,
depth, and pruning level, such that networks of vastly different pruned
densities are interchangeable. We demonstrate the accuracy of this
approximation over orders of magnitude in depth, width, dataset size, and
density. We show that the functional form holds (generalizes) for large scale
data (e.g., ImageNet) and architectures (e.g., ResNets). As neural networks
become ever larger and costlier to train, our findings suggest a framework for
reasoning conceptually and analytically about a standard method for
unstructured pruning.
Related papers
- Feature-Learning Networks Are Consistent Across Widths At Realistic
Scales [72.27228085606147]
We study the effect of width on the dynamics of feature-learning neural networks across a variety of architectures and datasets.
Early in training, wide neural networks trained on online data have not only identical loss curves but also agree in their point-wise test predictions throughout training.
We observe, however, that ensembles of narrower networks perform worse than a single wide network.
arXiv Detail & Related papers (2023-05-28T17:09:32Z) - Bayesian Interpolation with Deep Linear Networks [92.1721532941863]
Characterizing how neural network depth, width, and dataset size jointly impact model quality is a central problem in deep learning theory.
We show that linear networks make provably optimal predictions at infinite depth.
We also show that with data-agnostic priors, Bayesian model evidence in wide linear networks is maximized at infinite depth.
arXiv Detail & Related papers (2022-12-29T20:57:46Z) - Slope and generalization properties of neural networks [0.0]
We show that the distribution of the slope of a well-trained neural network classifier is generally independent of the width of the layers in a fully connected network.
The slope is of similar size throughout the relevant volume, and varies smoothly. It also behaves as predicted in rescaling examples.
We discuss possible applications of the slope concept, such as using it as a part of the loss function or stopping criterion during network training, or ranking data sets in terms of their complexity.
arXiv Detail & Related papers (2021-07-03T17:54:27Z) - Analytic Insights into Structure and Rank of Neural Network Hessian Maps [32.90143789616052]
Hessian of a neural network captures parameter interactions through second-order derivatives of the loss.
We develop theoretical tools to analyze the range of the Hessian map, providing us with a precise understanding of its rank deficiency.
This yields exact formulas and tight upper bounds for the Hessian rank of deep linear networks.
arXiv Detail & Related papers (2021-06-30T17:29:58Z) - Generic Perceptual Loss for Modeling Structured Output Dependencies [78.59700528239141]
We show that, what matters is the network structure instead of the trained weights.
We demonstrate that a randomly-weighted deep CNN can be used to model the structured dependencies of outputs.
arXiv Detail & Related papers (2021-03-18T23:56:07Z) - Lost in Pruning: The Effects of Pruning Neural Networks beyond Test
Accuracy [42.15969584135412]
Neural network pruning is a popular technique used to reduce the inference costs of modern networks.
We evaluate whether the use of test accuracy alone in the terminating condition is sufficient to ensure that the resulting model performs well.
We find that pruned networks effectively approximate the unpruned model, however, the prune ratio at which pruned networks achieve commensurate performance varies significantly across tasks.
arXiv Detail & Related papers (2021-03-04T13:22:16Z) - Explaining Neural Scaling Laws [17.115592382420626]
Population loss of trained deep neural networks often follows precise power-law scaling relations.
We propose a theory that explains the origins of and connects these scaling laws.
We identify variance-limited and resolution-limited scaling behavior for both dataset and model size.
arXiv Detail & Related papers (2021-02-12T18:57:46Z) - A Convergence Theory Towards Practical Over-parameterized Deep Neural
Networks [56.084798078072396]
We take a step towards closing the gap between theory and practice by significantly improving the known theoretical bounds on both the network width and the convergence time.
We show that convergence to a global minimum is guaranteed for networks with quadratic widths in the sample size and linear in their depth at a time logarithmic in both.
Our analysis and convergence bounds are derived via the construction of a surrogate network with fixed activation patterns that can be transformed at any time to an equivalent ReLU network of a reasonable size.
arXiv Detail & Related papers (2021-01-12T00:40:45Z) - Mixed-Privacy Forgetting in Deep Networks [114.3840147070712]
We show that the influence of a subset of the training samples can be removed from the weights of a network trained on large-scale image classification tasks.
Inspired by real-world applications of forgetting techniques, we introduce a novel notion of forgetting in mixed-privacy setting.
We show that our method allows forgetting without having to trade off the model accuracy.
arXiv Detail & Related papers (2020-12-24T19:34:56Z) - Understanding Generalization in Deep Learning via Tensor Methods [53.808840694241]
We advance the understanding of the relations between the network's architecture and its generalizability from the compression perspective.
We propose a series of intuitive, data-dependent and easily-measurable properties that tightly characterize the compressibility and generalizability of neural networks.
arXiv Detail & Related papers (2020-01-14T22:26:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.