MLPruning: A Multilevel Structured Pruning Framework for
Transformer-based Models
- URL: http://arxiv.org/abs/2105.14636v1
- Date: Sun, 30 May 2021 22:00:44 GMT
- Title: MLPruning: A Multilevel Structured Pruning Framework for
Transformer-based Models
- Authors: Zhewei Yao, Linjian Ma, Sheng Shen, Kurt Keutzer, Michael W. Mahoney
- Abstract summary: Pruning is an effective method to reduce the memory footprint and computational cost associated with large natural language processing models.
We develop a novel MultiLevel structured Pruning framework, which uses three different levels of structured pruning: head pruning, row pruning, and block-wise sparse pruning.
- Score: 78.45898846056303
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Pruning is an effective method to reduce the memory footprint and
computational cost associated with large natural language processing models.
However, current approaches either only explore head pruning, which has a
limited pruning ratio, or only focus on unstructured pruning, which has
negligible effects on the real inference time and/or power consumption. To
address these challenges, we develop a novel MultiLevel structured Pruning
(MLPruning) framework, which uses three different levels of structured pruning:
head pruning, row pruning, and block-wise sparse pruning. We propose using a
learnable Top-k threshold, which employs an adaptive regularization to adjust
the regularization magnitude adaptively, to select appropriate pruning ratios
for different weight matrices. We also propose a two-step pipeline to combine
block-wise pruning with head/row pruning to achieve high structured pruning
ratios with minimum accuracy degradation. Our empirical results show that for
\bertbase, with \textapprox20\% of remaining weights, \OURS can achieve an
accuracy that is comparable to the full model on QQP/MNLI/\squad, with up to
\textapprox3.69x speedup. Our framework has been open sourced~\cite{codebase}.
Related papers
- ADMM Based Semi-Structured Pattern Pruning Framework For Transformer [4.02487511510606]
This paper introduces Alternating Direction Method of Multipliers(ADMM) based pattern pruning framework to reshape the distribution of activation map.
We conduct extensive experiments on classification tasks over GLUE dataset.
We achieve 50% percent compression ratio while maintaining overall score 80.1 on GLUE dataset.
arXiv Detail & Related papers (2024-07-11T09:35:08Z) - Bypass Back-propagation: Optimization-based Structural Pruning for Large Language Models via Policy Gradient [57.9629676017527]
We propose an optimization-based structural pruning on Large-Language Models.
We learn the pruning masks in a probabilistic space directly by optimizing the loss of the pruned model.
Our method operates for 2.7 hours with around 35GB memory for the 13B models on a single A100 GPU.
arXiv Detail & Related papers (2024-06-15T09:31:03Z) - Advancing Model Pruning via Bi-level Optimization [89.88761425199598]
iterative magnitude pruning (IMP) is the predominant pruning method to successfully find 'winning tickets'
One-shot pruning methods have been developed, but these schemes are usually unable to find winning tickets as good as IMP.
We show that the proposed bi-level optimization-oriented pruning method (termed BiP) is a special class of BLO problems with a bi-linear problem structure.
arXiv Detail & Related papers (2022-10-08T19:19:29Z) - A Fast Post-Training Pruning Framework for Transformers [74.59556951906468]
Pruning is an effective way to reduce the huge inference cost of large Transformer models.
Prior work on model pruning requires retraining the model.
We propose a fast post-training pruning framework for Transformers that does not require any retraining.
arXiv Detail & Related papers (2022-03-29T07:41:11Z) - Data-Efficient Structured Pruning via Submodular Optimization [32.574190896543705]
We propose a data-efficient structured pruning method based on submodular optimization.
We show that this selection problem is a weakly submodular problem, thus it can be provably approximated using an efficient greedy algorithm.
Our method is one of the few in the literature that uses only a limited-number of training data and no labels.
arXiv Detail & Related papers (2022-03-09T18:40:29Z) - Dynamic Probabilistic Pruning: A general framework for
hardware-constrained pruning at different granularities [80.06422693778141]
We propose a flexible new pruning mechanism that facilitates pruning at different granularities (weights, kernels, filters/feature maps)
We refer to this algorithm as Dynamic Probabilistic Pruning (DPP)
We show that DPP achieves competitive compression rates and classification accuracy when pruning common deep learning models trained on different benchmark datasets for image classification.
arXiv Detail & Related papers (2021-05-26T17:01:52Z) - Lookahead: A Far-Sighted Alternative of Magnitude-based Pruning [83.99191569112682]
Magnitude-based pruning is one of the simplest methods for pruning neural networks.
We develop a simple pruning method, coined lookahead pruning, by extending the single layer optimization to a multi-layer optimization.
Our experimental results demonstrate that the proposed method consistently outperforms magnitude-based pruning on various networks.
arXiv Detail & Related papers (2020-02-12T05:38:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.