Related papers: MLPruning: A Multilevel Structured Pruning Framework for Transformer-based Models

MLPruning: A Multilevel Structured Pruning Framework for Transformer-based Models

URL: http://arxiv.org/abs/2105.14636v1
Date: Sun, 30 May 2021 22:00:44 GMT
Title: MLPruning: A Multilevel Structured Pruning Framework for Transformer-based Models
Authors: Zhewei Yao, Linjian Ma, Sheng Shen, Kurt Keutzer, Michael W. Mahoney
Abstract summary: Pruning is an effective method to reduce the memory footprint and computational cost associated with large natural language processing models. We develop a novel MultiLevel structured Pruning framework, which uses three different levels of structured pruning: head pruning, row pruning, and block-wise sparse pruning.
Score: 78.45898846056303
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Pruning is an effective method to reduce the memory footprint and computational cost associated with large natural language processing models. However, current approaches either only explore head pruning, which has a limited pruning ratio, or only focus on unstructured pruning, which has negligible effects on the real inference time and/or power consumption. To address these challenges, we develop a novel MultiLevel structured Pruning (MLPruning) framework, which uses three different levels of structured pruning: head pruning, row pruning, and block-wise sparse pruning. We propose using a learnable Top-k threshold, which employs an adaptive regularization to adjust the regularization magnitude adaptively, to select appropriate pruning ratios for different weight matrices. We also propose a two-step pipeline to combine block-wise pruning with head/row pruning to achieve high structured pruning ratios with minimum accuracy degradation. Our empirical results show that for \bertbase, with \textapprox20\% of remaining weights, \OURS can achieve an accuracy that is comparable to the full model on QQP/MNLI/\squad, with up to \textapprox3.69x speedup. Our framework has been open sourced~\cite{codebase}.

Related papers

PIP: Perturbation-based Iterative Pruning for Large Language Models [5.511065308044068]
We propose PIP (Perturbation-based Iterative Pruning), a novel double-view structured pruning method to optimize Large Language Models. Our experiments show that PIP reduces the parameter count by approximately 20% while retaining over 85% of the original model's accuracy.
arXiv Detail & Related papers (2025-01-25T17:10:50Z)
ADMM Based Semi-Structured Pattern Pruning Framework For Transformer [4.02487511510606]
This paper introduces Alternating Direction Method of Multipliers(ADMM) based pattern pruning framework to reshape the distribution of activation map. We conduct extensive experiments on classification tasks over GLUE dataset. We achieve 50% percent compression ratio while maintaining overall score 80.1 on GLUE dataset.
arXiv Detail & Related papers (2024-07-11T09:35:08Z)
Bypass Back-propagation: Optimization-based Structural Pruning for Large Language Models via Policy Gradient [57.9629676017527]
We propose an optimization-based structural pruning on Large-Language Models. We learn the pruning masks in a probabilistic space directly by optimizing the loss of the pruned model. Our method operates for 2.7 hours with around 35GB memory for the 13B models on a single A100 GPU.
arXiv Detail & Related papers (2024-06-15T09:31:03Z)
Advancing Model Pruning via Bi-level Optimization [89.88761425199598]
iterative magnitude pruning (IMP) is the predominant pruning method to successfully find 'winning tickets' One-shot pruning methods have been developed, but these schemes are usually unable to find winning tickets as good as IMP. We show that the proposed bi-level optimization-oriented pruning method (termed BiP) is a special class of BLO problems with a bi-linear problem structure.
arXiv Detail & Related papers (2022-10-08T19:19:29Z)
A Fast Post-Training Pruning Framework for Transformers [74.59556951906468]
Pruning is an effective way to reduce the huge inference cost of large Transformer models. Prior work on model pruning requires retraining the model. We propose a fast post-training pruning framework for Transformers that does not require any retraining.
arXiv Detail & Related papers (2022-03-29T07:41:11Z)
Data-Efficient Structured Pruning via Submodular Optimization [32.574190896543705]
We propose a data-efficient structured pruning method based on submodular optimization. We show that this selection problem is a weakly submodular problem, thus it can be provably approximated using an efficient greedy algorithm. Our method is one of the few in the literature that uses only a limited-number of training data and no labels.
arXiv Detail & Related papers (2022-03-09T18:40:29Z)
Dynamic Probabilistic Pruning: A general framework for hardware-constrained pruning at different granularities [80.06422693778141]
We propose a flexible new pruning mechanism that facilitates pruning at different granularities (weights, kernels, filters/feature maps) We refer to this algorithm as Dynamic Probabilistic Pruning (DPP) We show that DPP achieves competitive compression rates and classification accuracy when pruning common deep learning models trained on different benchmark datasets for image classification.
arXiv Detail & Related papers (2021-05-26T17:01:52Z)
Lookahead: A Far-Sighted Alternative of Magnitude-based Pruning [83.99191569112682]
Magnitude-based pruning is one of the simplest methods for pruning neural networks. We develop a simple pruning method, coined lookahead pruning, by extending the single layer optimization to a multi-layer optimization. Our experimental results demonstrate that the proposed method consistently outperforms magnitude-based pruning on various networks.
arXiv Detail & Related papers (2020-02-12T05:38:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.