Rethinking Network Pruning -- under the Pre-train and Fine-tune Paradigm
- URL: http://arxiv.org/abs/2104.08682v1
- Date: Sun, 18 Apr 2021 02:20:37 GMT
- Title: Rethinking Network Pruning -- under the Pre-train and Fine-tune Paradigm
- Authors: Dongkuan Xu, Ian E.H. Yen, Jinxi Zhao, Zhibin Xiao
- Abstract summary: We show for the first time that sparse pruning compresses a BERT model significantly more than reducing its number of channels and layers.
Our method outperforms the leading competitors with a 20-times weight/FLOPs compression and neglectable loss in prediction accuracy.
- Score: 5.621336109915588
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Transformer-based pre-trained language models have significantly improved the
performance of various natural language processing (NLP) tasks in the recent
years. While effective and prevalent, these models are usually prohibitively
large for resource-limited deployment scenarios. A thread of research has thus
been working on applying network pruning techniques under the
pretrain-then-finetune paradigm widely adopted in NLP. However, the existing
pruning results on benchmark transformers, such as BERT, are not as remarkable
as the pruning results in the literature of convolutional neural networks
(CNNs). In particular, common wisdom in pruning CNN states that sparse pruning
technique compresses a model more than that obtained by reducing number of
channels and layers (Elsen et al., 2020; Zhu and Gupta, 2017), while existing
works on sparse pruning of BERT yields inferior results than its small-dense
counterparts such as TinyBERT (Jiao et al., 2020). In this work, we aim to fill
this gap by studying how knowledge are transferred and lost during the
pre-train, fine-tune, and pruning process, and proposing a knowledge-aware
sparse pruning process that achieves significantly superior results than
existing literature. We show for the first time that sparse pruning compresses
a BERT model significantly more than reducing its number of channels and
layers. Experiments on multiple data sets of GLUE benchmark show that our
method outperforms the leading competitors with a 20-times weight/FLOPs
compression and neglectable loss in prediction accuracy.
Related papers
- YOSO: You-Only-Sample-Once via Compressed Sensing for Graph Neural Network Training [9.02251811867533]
YOSO (You-Only-Sample-Once) is an algorithm designed to achieve efficient training while preserving prediction accuracy.
YOSO not only avoids costly computations in traditional compressed sensing (CS) methods, such as orthonormal basis calculations, but also ensures high-probability accuracy retention.
arXiv Detail & Related papers (2024-11-08T16:47:51Z) - Structure-Preserving Network Compression Via Low-Rank Induced Training Through Linear Layers Composition [11.399520888150468]
We present a theoretically-justified technique termed Low-Rank Induced Training (LoRITa)
LoRITa promotes low-rankness through the composition of linear layers and compresses by using singular value truncation.
We demonstrate the effectiveness of our approach using MNIST on Fully Connected Networks, CIFAR10 on Vision Transformers, and CIFAR10/100 and ImageNet on Convolutional Neural Networks.
arXiv Detail & Related papers (2024-05-06T00:58:23Z) - ThinResNet: A New Baseline for Structured Convolutional Networks Pruning [1.90298817989995]
Pruning is a compression method which aims to improve the efficiency of neural networks by reducing their number of parameters.
In this work, we verify how results in the recent literature of pruning hold up against networks that underwent both state-of-the-art training methods and trivial model scaling.
arXiv Detail & Related papers (2023-09-22T13:28:18Z) - Instant Soup: Cheap Pruning Ensembles in A Single Pass Can Draw Lottery
Tickets from Large Models [106.19385911520652]
Lottery Ticket Hypothesis (LTH) and its variants have been exploited to prune large pre-trained models generating parameterworks.
LTH is enormously inhibited by repetitive full training and pruning routine of iterative magnitude pruning (IMP)
We propose Instant Soup Pruning (ISP) to generate lottery ticket quality IMPworks.
arXiv Detail & Related papers (2023-06-18T03:09:52Z) - Pruning Deep Neural Networks from a Sparsity Perspective [34.22967841734504]
Pruning is often achieved by dropping redundant weights, neurons, or layers of a deep network while attempting to retain a comparable test performance.
We propose PQ Index (PQI) to measure the potential compressibility of deep neural networks and use this to develop a Sparsity-informed Adaptive Pruning (SAP) algorithm.
arXiv Detail & Related papers (2023-02-11T04:52:20Z) - GDP: Stabilized Neural Network Pruning via Gates with Differentiable
Polarization [84.57695474130273]
Gate-based or importance-based pruning methods aim to remove channels whose importance is smallest.
GDP can be plugged before convolutional layers without bells and whistles, to control the on-and-off of each channel.
Experiments conducted over CIFAR-10 and ImageNet datasets show that the proposed GDP achieves the state-of-the-art performance.
arXiv Detail & Related papers (2021-09-06T03:17:10Z) - Sparse Training via Boosting Pruning Plasticity with Neuroregeneration [79.78184026678659]
We study the effect of pruning throughout training from the perspective of pruning plasticity.
We design a novel gradual magnitude pruning (GMP) method, named gradual pruning with zero-cost neuroregeneration (GraNet) and its dynamic sparse training (DST) variant (GraNet-ST)
Perhaps most impressively, the latter for the first time boosts the sparse-to-sparse training performance over various dense-to-sparse methods by a large margin with ResNet-50 on ImageNet.
arXiv Detail & Related papers (2021-06-19T02:09:25Z) - S2-BNN: Bridging the Gap Between Self-Supervised Real and 1-bit Neural
Networks via Guided Distribution Calibration [74.5509794733707]
We present a novel guided learning paradigm from real-valued to distill binary networks on the final prediction distribution.
Our proposed method can boost the simple contrastive learning baseline by an absolute gain of 5.515% on BNNs.
Our method achieves substantial improvement over the simple contrastive learning baseline, and is even comparable to many mainstream supervised BNN methods.
arXiv Detail & Related papers (2021-02-17T18:59:28Z) - Neural Pruning via Growing Regularization [82.9322109208353]
We extend regularization to tackle two central problems of pruning: pruning schedule and weight importance scoring.
Specifically, we propose an L2 regularization variant with rising penalty factors and show it can bring significant accuracy gains.
The proposed algorithms are easy to implement and scalable to large datasets and networks in both structured and unstructured pruning.
arXiv Detail & Related papers (2020-12-16T20:16:28Z) - Robust Pruning at Initialization [61.30574156442608]
A growing need for smaller, energy-efficient, neural networks to be able to use machine learning applications on devices with limited computational resources.
For Deep NNs, such procedures remain unsatisfactory as the resulting pruned networks can be difficult to train and, for instance, they do not prevent one layer from being fully pruned.
arXiv Detail & Related papers (2020-02-19T17:09:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.