Related papers: Layer-wise Pruning of Transformer Attention Heads for Efficient Language Modeling

Layer-wise Pruning of Transformer Attention Heads for Efficient Language Modeling

URL: http://arxiv.org/abs/2110.03252v1
Date: Thu, 7 Oct 2021 08:19:26 GMT
Title: Layer-wise Pruning of Transformer Attention Heads for Efficient Language Modeling
Authors: Kyuhong Shim, Iksoo Choi, Wonyong Sung, Jungwook Choi
Abstract summary: Attention head pruning is a promising technique to solve this problem. We propose three training methods that are especially helpful to minimize performance degradation. Our pruned model shows consistently lower perplexity within a comparable parameter size than Transformer-XL on WikiText-103 language modeling benchmark.
Score: 22.278610066038954
License: http://creativecommons.org/licenses/by/4.0/
Abstract: While Transformer-based models have shown impressive language modeling performance, the large computation cost is often prohibitive for practical use. Attention head pruning, which removes unnecessary attention heads in the multihead attention, is a promising technique to solve this problem. However, it does not evenly reduce the overall load because the heavy feedforward module is not affected by head pruning. In this paper, we apply layer-wise attention head pruning on All-attention Transformer so that the entire computation and the number of parameters can be reduced proportionally to the number of pruned heads. While the architecture has the potential to fully utilize head pruning, we propose three training methods that are especially helpful to minimize performance degradation and stabilize the pruning process. Our pruned model shows consistently lower perplexity within a comparable parameter size than Transformer-XL on WikiText-103 language modeling benchmark.

Related papers

Leaner Transformers: More Heads, Less Depth [39.80661571556767]
Transformers have reshaped machine learning by utilizing attention mechanisms to capture complex patterns in large datasets.<n>This paper challenge this belief by showing that many existing transformers might be unnecessarily oversized.<n>We exploit this theoretical insight and redesign popular architectures with an increased number of heads.
arXiv Detail & Related papers (2025-05-27T07:06:54Z)
Lean Attention: Hardware-Aware Scalable Attention Mechanism for the Decode-Phase of Transformers [4.674454841332859]
Transformer-based models have emerged as one of the most widely used architectures for natural language processing. These huge models are memory hungry and incur significant inference latency even on cutting edge AI-accelerators. We propose LeanAttention, a scalable technique of computing self-attention for the token-generation phase.
arXiv Detail & Related papers (2024-05-17T00:52:39Z)
A Fast Post-Training Pruning Framework for Transformers [74.59556951906468]
Pruning is an effective way to reduce the huge inference cost of large Transformer models. Prior work on model pruning requires retraining the model. We propose a fast post-training pruning framework for Transformers that does not require any retraining.
arXiv Detail & Related papers (2022-03-29T07:41:11Z)
Differentiable Subset Pruning of Transformer Heads [71.7904179689271]
We introduce a new head pruning technique that we term differentiable subset pruning. We show that differentiable subset pruning performs comparably or better than previous works while offering precise control of the sparsity level.
arXiv Detail & Related papers (2021-08-10T13:08:34Z)
MLPruning: A Multilevel Structured Pruning Framework for Transformer-based Models [78.45898846056303]
Pruning is an effective method to reduce the memory footprint and computational cost associated with large natural language processing models. We develop a novel MultiLevel structured Pruning framework, which uses three different levels of structured pruning: head pruning, row pruning, and block-wise sparse pruning.
arXiv Detail & Related papers (2021-05-30T22:00:44Z)
Know What You Don't Need: Single-Shot Meta-Pruning for Attention Heads [114.77890059625162]
We propose a method, called Single-Shot Meta-Pruning, to compress deep pre-trained Transformers before fine-tuning. We focus on pruning unnecessary attention heads adaptively for different downstream tasks. Compared with existing compression methods for pre-trained models, our method can reduce the overhead of both fine-tuning and inference.
arXiv Detail & Related papers (2020-11-07T12:58:37Z)
Train Large, Then Compress: Rethinking Model Size for Efficient Training and Inference of Transformers [94.43313684188819]
We study the impact of model size in this setting, focusing on Transformer models for NLP tasks that are limited by compute. We first show that even though smaller Transformer models execute faster per iteration, wider and deeper models converge in significantly fewer steps. This leads to an apparent trade-off between the training efficiency of large Transformer models and the inference efficiency of small Transformer models.
arXiv Detail & Related papers (2020-02-26T21:17:13Z)
Low-Rank Bottleneck in Multi-head Attention Models [74.83235382203604]
We argue that the scaling between the number of heads and the size of each head in the current architecture gives rise to a low-rank bottleneck in attention heads. We propose to set the head size of an attention unit to input sequence length, and independent of the number of heads, resulting in multi-head attention layers with provably more expressive power.
arXiv Detail & Related papers (2020-02-17T16:16:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.