A Three-regime Model of Network Pruning
- URL: http://arxiv.org/abs/2305.18383v1
- Date: Sun, 28 May 2023 08:09:25 GMT
- Title: A Three-regime Model of Network Pruning
- Authors: Yefan Zhou, Yaoqing Yang, Arin Chang, Michael W. Mahoney
- Abstract summary: We use temperature-like and load-like parameters to model the impact of neural network (NN) training hyper parameters on pruning performance.
A key empirical result we identify is a sharp transition phenomenon: depending on the value of a load-like parameter in the pruned model, increasing the value of a temperature-like parameter in the pre-pruned model may either enhance or impair subsequent pruning performance.
Our model reveals that the dichotomous effect of high temperature is associated with transitions between distinct types of global structures in the post-pruned model.
- Score: 47.92525418773768
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent work has highlighted the complex influence training hyperparameters,
e.g., the number of training epochs, can have on the prunability of machine
learning models. Perhaps surprisingly, a systematic approach to predict
precisely how adjusting a specific hyperparameter will affect prunability
remains elusive. To address this gap, we introduce a phenomenological model
grounded in the statistical mechanics of learning. Our approach uses
temperature-like and load-like parameters to model the impact of neural network
(NN) training hyperparameters on pruning performance. A key empirical result we
identify is a sharp transition phenomenon: depending on the value of a
load-like parameter in the pruned model, increasing the value of a
temperature-like parameter in the pre-pruned model may either enhance or impair
subsequent pruning performance. Based on this transition, we build a
three-regime model by taxonomizing the global structure of the pruned NN loss
landscape. Our model reveals that the dichotomous effect of high temperature is
associated with transitions between distinct types of global structures in the
post-pruned model. Based on our results, we present three case-studies: 1)
determining whether to increase or decrease a hyperparameter for improved
pruning; 2) selecting the best model to prune from a family of models; and 3)
tuning the hyperparameter of the Sharpness Aware Minimization method for better
pruning performance.
Related papers
- SMILE: Zero-Shot Sparse Mixture of Low-Rank Experts Construction From Pre-Trained Foundation Models [85.67096251281191]
We present an innovative approach to model fusion called zero-shot Sparse MIxture of Low-rank Experts (SMILE) construction.
SMILE allows for the upscaling of source models into an MoE model without extra data or further training.
We conduct extensive experiments across diverse scenarios, such as image classification and text generation tasks, using full fine-tuning and LoRA fine-tuning.
arXiv Detail & Related papers (2024-08-19T17:32:15Z) - Epidemic Modeling using Hybrid of Time-varying SIRD, Particle Swarm
Optimization, and Deep Learning [6.363653898208231]
Epidemiological models are best suitable to model an epidemic if the spread pattern is stationary.
We develop a hybrid model encompassing epidemic modeling, particle swarm optimization, and deep learning.
We evaluate the model for highly affected three countries namely; the USA, India, and the UK.
arXiv Detail & Related papers (2024-01-31T18:08:06Z) - Enhancing Dynamical System Modeling through Interpretable Machine
Learning Augmentations: A Case Study in Cathodic Electrophoretic Deposition [0.8796261172196743]
We introduce a comprehensive data-driven framework aimed at enhancing the modeling of physical systems.
As a demonstrative application, we pursue the modeling of cathodic electrophoretic deposition (EPD), commonly known as e-coating.
arXiv Detail & Related papers (2024-01-16T14:58:21Z) - A PAC-Bayesian Perspective on the Interpolating Information Criterion [54.548058449535155]
We show how a PAC-Bayes bound is obtained for a general class of models, characterizing factors which influence performance in the interpolating regime.
We quantify how the test error for overparameterized models achieving effectively zero training error depends on the quality of the implicit regularization imposed by e.g. the combination of model, parameter-initialization scheme.
arXiv Detail & Related papers (2023-11-13T01:48:08Z) - E^2VPT: An Effective and Efficient Approach for Visual Prompt Tuning [55.50908600818483]
Fine-tuning large-scale pretrained vision models for new tasks has become increasingly parameter-intensive.
We propose an Effective and Efficient Visual Prompt Tuning (E2VPT) approach for large-scale transformer-based model adaptation.
Our approach outperforms several state-of-the-art baselines on two benchmarks.
arXiv Detail & Related papers (2023-07-25T19:03:21Z) - Understanding Parameter Sharing in Transformers [53.75988363281843]
Previous work on Transformers has focused on sharing parameters in different layers, which can improve the performance of models with limited parameters by increasing model depth.
We show that the success of this approach can be largely attributed to better convergence, with only a small part due to the increased model complexity.
Experiments on 8 machine translation tasks show that our model achieves competitive performance with only half the model complexity of parameter sharing models.
arXiv Detail & Related papers (2023-06-15T10:48:59Z) - Forecasting the 2016-2017 Central Apennines Earthquake Sequence with a
Neural Point Process [0.0]
We investigate whether flexible point process models can be applied to short-term seismicity forecasting.
We show how a temporal neural model can forecast earthquakes above a target magnitude threshold.
arXiv Detail & Related papers (2023-01-24T12:15:12Z) - MoEfication: Conditional Computation of Transformer Models for Efficient
Inference [66.56994436947441]
Transformer-based pre-trained language models can achieve superior performance on most NLP tasks due to large parameter capacity, but also lead to huge computation cost.
We explore to accelerate large-model inference by conditional computation based on the sparse activation phenomenon.
We propose to transform a large model into its mixture-of-experts (MoE) version with equal model size, namely MoEfication.
arXiv Detail & Related papers (2021-10-05T02:14:38Z) - Provable Benefits of Overparameterization in Model Compression: From
Double Descent to Pruning Neural Networks [38.153825455980645]
Recent empirical evidence indicates that the practice of overization not only benefits training large models, but also assists - perhaps counterintuitively - building lightweight models.
This paper sheds light on these empirical findings by theoretically characterizing the high-dimensional toolsets of model pruning.
We analytically identify regimes in which, even if the location of the most informative features is known, we are better off fitting a large model and then pruning.
arXiv Detail & Related papers (2020-12-16T05:13:30Z) - Deep Neural Network in Cusp Catastrophe Model [0.0]
Catastrophe theory was originally proposed to dynamical systems that exhibit sudden shifts in behavior from small changes in input.
Here we show how a Catastrophe model can be trained to learn the dynamics of the Cusp Machine models, without really solving for the generating parameters.
arXiv Detail & Related papers (2020-04-06T00:25:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.