Related papers: Rethinking Network Pruning -- under the Pre-train and Fine-tune Paradigm

Rethinking Network Pruning -- under the Pre-train and Fine-tune Paradigm

URL: http://arxiv.org/abs/2104.08682v1
Date: Sun, 18 Apr 2021 02:20:37 GMT
Title: Rethinking Network Pruning -- under the Pre-train and Fine-tune Paradigm
Authors: Dongkuan Xu, Ian E.H. Yen, Jinxi Zhao, Zhibin Xiao
Abstract summary: We show for the first time that sparse pruning compresses a BERT model significantly more than reducing its number of channels and layers. Our method outperforms the leading competitors with a 20-times weight/FLOPs compression and neglectable loss in prediction accuracy.
Score: 5.621336109915588
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Transformer-based pre-trained language models have significantly improved the performance of various natural language processing (NLP) tasks in the recent years. While effective and prevalent, these models are usually prohibitively large for resource-limited deployment scenarios. A thread of research has thus been working on applying network pruning techniques under the pretrain-then-finetune paradigm widely adopted in NLP. However, the existing pruning results on benchmark transformers, such as BERT, are not as remarkable as the pruning results in the literature of convolutional neural networks (CNNs). In particular, common wisdom in pruning CNN states that sparse pruning technique compresses a model more than that obtained by reducing number of channels and layers (Elsen et al., 2020; Zhu and Gupta, 2017), while existing works on sparse pruning of BERT yields inferior results than its small-dense counterparts such as TinyBERT (Jiao et al., 2020). In this work, we aim to fill this gap by studying how knowledge are transferred and lost during the pre-train, fine-tune, and pruning process, and proposing a knowledge-aware sparse pruning process that achieves significantly superior results than existing literature. We show for the first time that sparse pruning compresses a BERT model significantly more than reducing its number of channels and layers. Experiments on multiple data sets of GLUE benchmark show that our method outperforms the leading competitors with a 20-times weight/FLOPs compression and neglectable loss in prediction accuracy.

Related papers

YOSO: You-Only-Sample-Once via Compressed Sensing for Graph Neural Network Training [9.02251811867533]
YOSO (You-Only-Sample-Once) is an algorithm designed to achieve efficient training while preserving prediction accuracy. YOSO not only avoids costly computations in traditional compressed sensing (CS) methods, such as orthonormal basis calculations, but also ensures high-probability accuracy retention.
arXiv Detail & Related papers (2024-11-08T16:47:51Z)
Structure-Preserving Network Compression Via Low-Rank Induced Training Through Linear Layers Composition [11.399520888150468]
We present a theoretically-justified technique termed Low-Rank Induced Training (LoRITa) LoRITa promotes low-rankness through the composition of linear layers and compresses by using singular value truncation. We demonstrate the effectiveness of our approach using MNIST on Fully Connected Networks, CIFAR10 on Vision Transformers, and CIFAR10/100 and ImageNet on Convolutional Neural Networks.
arXiv Detail & Related papers (2024-05-06T00:58:23Z)
ThinResNet: A New Baseline for Structured Convolutional Networks Pruning [1.90298817989995]
Pruning is a compression method which aims to improve the efficiency of neural networks by reducing their number of parameters. In this work, we verify how results in the recent literature of pruning hold up against networks that underwent both state-of-the-art training methods and trivial model scaling.
arXiv Detail & Related papers (2023-09-22T13:28:18Z)
Instant Soup: Cheap Pruning Ensembles in A Single Pass Can Draw Lottery Tickets from Large Models [106.19385911520652]
Lottery Ticket Hypothesis (LTH) and its variants have been exploited to prune large pre-trained models generating parameterworks. LTH is enormously inhibited by repetitive full training and pruning routine of iterative magnitude pruning (IMP) We propose Instant Soup Pruning (ISP) to generate lottery ticket quality IMPworks.
arXiv Detail & Related papers (2023-06-18T03:09:52Z)
Pruning Deep Neural Networks from a Sparsity Perspective [34.22967841734504]
Pruning is often achieved by dropping redundant weights, neurons, or layers of a deep network while attempting to retain a comparable test performance. We propose PQ Index (PQI) to measure the potential compressibility of deep neural networks and use this to develop a Sparsity-informed Adaptive Pruning (SAP) algorithm.
arXiv Detail & Related papers (2023-02-11T04:52:20Z)
GDP: Stabilized Neural Network Pruning via Gates with Differentiable Polarization [84.57695474130273]
Gate-based or importance-based pruning methods aim to remove channels whose importance is smallest. GDP can be plugged before convolutional layers without bells and whistles, to control the on-and-off of each channel. Experiments conducted over CIFAR-10 and ImageNet datasets show that the proposed GDP achieves the state-of-the-art performance.
arXiv Detail & Related papers (2021-09-06T03:17:10Z)
Sparse Training via Boosting Pruning Plasticity with Neuroregeneration [79.78184026678659]
We study the effect of pruning throughout training from the perspective of pruning plasticity. We design a novel gradual magnitude pruning (GMP) method, named gradual pruning with zero-cost neuroregeneration (GraNet) and its dynamic sparse training (DST) variant (GraNet-ST) Perhaps most impressively, the latter for the first time boosts the sparse-to-sparse training performance over various dense-to-sparse methods by a large margin with ResNet-50 on ImageNet.
arXiv Detail & Related papers (2021-06-19T02:09:25Z)
S2-BNN: Bridging the Gap Between Self-Supervised Real and 1-bit Neural Networks via Guided Distribution Calibration [74.5509794733707]
We present a novel guided learning paradigm from real-valued to distill binary networks on the final prediction distribution. Our proposed method can boost the simple contrastive learning baseline by an absolute gain of 5.515% on BNNs. Our method achieves substantial improvement over the simple contrastive learning baseline, and is even comparable to many mainstream supervised BNN methods.
arXiv Detail & Related papers (2021-02-17T18:59:28Z)
Neural Pruning via Growing Regularization [82.9322109208353]
We extend regularization to tackle two central problems of pruning: pruning schedule and weight importance scoring. Specifically, we propose an L2 regularization variant with rising penalty factors and show it can bring significant accuracy gains. The proposed algorithms are easy to implement and scalable to large datasets and networks in both structured and unstructured pruning.
arXiv Detail & Related papers (2020-12-16T20:16:28Z)
Robust Pruning at Initialization [61.30574156442608]
A growing need for smaller, energy-efficient, neural networks to be able to use machine learning applications on devices with limited computational resources. For Deep NNs, such procedures remain unsatisfactory as the resulting pruned networks can be difficult to train and, for instance, they do not prevent one layer from being fully pruned.
arXiv Detail & Related papers (2020-02-19T17:09:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.