Layer-wise Pruning of Transformer Attention Heads for Efficient Language
Modeling
- URL: http://arxiv.org/abs/2110.03252v1
- Date: Thu, 7 Oct 2021 08:19:26 GMT
- Title: Layer-wise Pruning of Transformer Attention Heads for Efficient Language
Modeling
- Authors: Kyuhong Shim, Iksoo Choi, Wonyong Sung, Jungwook Choi
- Abstract summary: Attention head pruning is a promising technique to solve this problem.
We propose three training methods that are especially helpful to minimize performance degradation.
Our pruned model shows consistently lower perplexity within a comparable parameter size than Transformer-XL on WikiText-103 language modeling benchmark.
- Score: 22.278610066038954
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: While Transformer-based models have shown impressive language modeling
performance, the large computation cost is often prohibitive for practical use.
Attention head pruning, which removes unnecessary attention heads in the
multihead attention, is a promising technique to solve this problem. However,
it does not evenly reduce the overall load because the heavy feedforward module
is not affected by head pruning. In this paper, we apply layer-wise attention
head pruning on All-attention Transformer so that the entire computation and
the number of parameters can be reduced proportionally to the number of pruned
heads. While the architecture has the potential to fully utilize head pruning,
we propose three training methods that are especially helpful to minimize
performance degradation and stabilize the pruning process. Our pruned model
shows consistently lower perplexity within a comparable parameter size than
Transformer-XL on WikiText-103 language modeling benchmark.
Related papers
- Lean Attention: Hardware-Aware Scalable Attention Mechanism for the Decode-Phase of Transformers [4.674454841332859]
Transformer-based models have emerged as one of the most widely used architectures for natural language processing.
These huge models are memory hungry and incur significant inference latency even on cutting edge AI-accelerators.
We propose LeanAttention, a scalable technique of computing self-attention for the token-generation phase.
arXiv Detail & Related papers (2024-05-17T00:52:39Z) - A Fast Post-Training Pruning Framework for Transformers [74.59556951906468]
Pruning is an effective way to reduce the huge inference cost of large Transformer models.
Prior work on model pruning requires retraining the model.
We propose a fast post-training pruning framework for Transformers that does not require any retraining.
arXiv Detail & Related papers (2022-03-29T07:41:11Z) - Differentiable Subset Pruning of Transformer Heads [71.7904179689271]
We introduce a new head pruning technique that we term differentiable subset pruning.
We show that differentiable subset pruning performs comparably or better than previous works while offering precise control of the sparsity level.
arXiv Detail & Related papers (2021-08-10T13:08:34Z) - MLPruning: A Multilevel Structured Pruning Framework for
Transformer-based Models [78.45898846056303]
Pruning is an effective method to reduce the memory footprint and computational cost associated with large natural language processing models.
We develop a novel MultiLevel structured Pruning framework, which uses three different levels of structured pruning: head pruning, row pruning, and block-wise sparse pruning.
arXiv Detail & Related papers (2021-05-30T22:00:44Z) - Know What You Don't Need: Single-Shot Meta-Pruning for Attention Heads [114.77890059625162]
We propose a method, called Single-Shot Meta-Pruning, to compress deep pre-trained Transformers before fine-tuning.
We focus on pruning unnecessary attention heads adaptively for different downstream tasks.
Compared with existing compression methods for pre-trained models, our method can reduce the overhead of both fine-tuning and inference.
arXiv Detail & Related papers (2020-11-07T12:58:37Z) - Train Large, Then Compress: Rethinking Model Size for Efficient Training
and Inference of Transformers [94.43313684188819]
We study the impact of model size in this setting, focusing on Transformer models for NLP tasks that are limited by compute.
We first show that even though smaller Transformer models execute faster per iteration, wider and deeper models converge in significantly fewer steps.
This leads to an apparent trade-off between the training efficiency of large Transformer models and the inference efficiency of small Transformer models.
arXiv Detail & Related papers (2020-02-26T21:17:13Z) - Low-Rank Bottleneck in Multi-head Attention Models [74.83235382203604]
We argue that the scaling between the number of heads and the size of each head in the current architecture gives rise to a low-rank bottleneck in attention heads.
We propose to set the head size of an attention unit to input sequence length, and independent of the number of heads, resulting in multi-head attention layers with provably more expressive power.
arXiv Detail & Related papers (2020-02-17T16:16:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.