A Fast Post-Training Pruning Framework for Transformers
- URL: http://arxiv.org/abs/2204.09656v1
- Date: Tue, 29 Mar 2022 07:41:11 GMT
- Title: A Fast Post-Training Pruning Framework for Transformers
- Authors: Woosuk Kwon, Sehoon Kim, Michael W. Mahoney, Joseph Hassoun, Kurt
Keutzer, Amir Gholami
- Abstract summary: Pruning is an effective way to reduce the huge inference cost of large Transformer models.
Prior work on model pruning requires retraining the model.
We propose a fast post-training pruning framework for Transformers that does not require any retraining.
- Score: 74.59556951906468
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Pruning is an effective way to reduce the huge inference cost of large
Transformer models. However, prior work on model pruning requires retraining
the model. This can add high cost and complexity to model deployment, making it
difficult to use in many practical situations. To address this, we propose a
fast post-training pruning framework for Transformers that does not require any
retraining. Given a resource constraint and a sample dataset, our framework
automatically prunes the Transformer model using structured sparsity methods.
To retain high accuracy without retraining, we introduce three novel
techniques: (i) a lightweight mask search algorithm that finds which heads and
filters to prune based on the Fisher information; (ii) mask rearrangement that
complements the search algorithm; and (iii) mask tuning that reconstructs the
output activations for each layer. We apply our method to BERT-BASE and
DistilBERT, and we evaluate its effectiveness on GLUE and SQuAD benchmarks. Our
framework achieves up to 2.0x reduction in FLOPs and 1.56x speedup in inference
latency, while maintaining < 1% loss in accuracy. Importantly, our framework
prunes Transformers in less than 3 minutes on a single GPU, which is over two
orders of magnitude faster than existing pruning approaches that retrain. Our
code is publicly available.
Related papers
- Bypass Back-propagation: Optimization-based Structural Pruning for Large Language Models via Policy Gradient [57.9629676017527]
We propose an optimization-based structural pruning on Large-Language Models.
We learn the pruning masks in a probabilistic space directly by optimizing the loss of the pruned model.
Our method operates for 2.7 hours with around 35GB memory for the 13B models on a single A100 GPU.
arXiv Detail & Related papers (2024-06-15T09:31:03Z) - STAT: Shrinking Transformers After Training [72.0726371426711]
We present STAT, a simple algorithm to prune transformer models without any fine-tuning.
STAT eliminates both attention heads and neurons from the network, while preserving accuracy by calculating a correction to the weights of the next layer.
Our entire algorithm takes minutes to compress BERT, and less than three hours to compress models with 7B parameters using a single GPU.
arXiv Detail & Related papers (2024-05-29T22:59:11Z) - SLAB: Efficient Transformers with Simplified Linear Attention and Progressive Re-parameterized Batch Normalization [36.84275777364218]
This paper investigates the computational bottleneck modules of efficient transformer, i.e., normalization layers and attention modules.
LayerNorm is commonly used in transformer architectures but is not computational friendly due to statistic calculation during inference.
We propose a novel method named PRepBN to progressively replace LayerNorm with re- parameterized BatchNorm in training.
arXiv Detail & Related papers (2024-05-19T15:22:25Z) - DoT: An efficient Double Transformer for NLP tasks with tables [3.0079490585515343]
DoT is a double transformer model that decomposes the problem into two sub-tasks.
We show that for a small drop of accuracy, DoT improves training and inference time by at least 50%.
arXiv Detail & Related papers (2021-06-01T13:33:53Z) - MLPruning: A Multilevel Structured Pruning Framework for
Transformer-based Models [78.45898846056303]
Pruning is an effective method to reduce the memory footprint and computational cost associated with large natural language processing models.
We develop a novel MultiLevel structured Pruning framework, which uses three different levels of structured pruning: head pruning, row pruning, and block-wise sparse pruning.
arXiv Detail & Related papers (2021-05-30T22:00:44Z) - Finding Fast Transformers: One-Shot Neural Architecture Search by
Component Composition [11.6409723227448]
Transformer-based models have achieved stateof-the-art results in many tasks in natural language processing.
We develop an efficient algorithm to search for fast models while maintaining model quality.
arXiv Detail & Related papers (2020-08-15T23:12:25Z) - The Cascade Transformer: an Application for Efficient Answer Sentence
Selection [116.09532365093659]
We introduce the Cascade Transformer, a technique to adapt transformer-based models into a cascade of rankers.
When compared to a state-of-the-art transformer model, our approach reduces computation by 37% with almost no impact on accuracy.
arXiv Detail & Related papers (2020-05-05T23:32:01Z) - The Right Tool for the Job: Matching Model and Instance Complexities [62.95183777679024]
As NLP models become larger, executing a trained model requires significant computational resources incurring monetary and environmental costs.
We propose a modification to contextual representation fine-tuning which, during inference, allows for an early (and fast) "exit"
We test our proposed modification on five different datasets in two tasks: three text classification datasets and two natural language inference benchmarks.
arXiv Detail & Related papers (2020-04-16T04:28:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.