Related papers: Low-Rank Bottleneck in Multi-head Attention Models

Low-Rank Bottleneck in Multi-head Attention Models

URL: http://arxiv.org/abs/2002.07028v1
Date: Mon, 17 Feb 2020 16:16:40 GMT
Title: Low-Rank Bottleneck in Multi-head Attention Models
Authors: Srinadh Bhojanapalli, Chulhee Yun, Ankit Singh Rawat, Sashank J. Reddi, Sanjiv Kumar
Abstract summary: We argue that the scaling between the number of heads and the size of each head in the current architecture gives rise to a low-rank bottleneck in attention heads. We propose to set the head size of an attention unit to input sequence length, and independent of the number of heads, resulting in multi-head attention layers with provably more expressive power.
Score: 74.83235382203604
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Attention based Transformer architecture has enabled significant advances in the field of natural language processing. In addition to new pre-training techniques, recent improvements crucially rely on working with a relatively larger embedding dimension for tokens. Unfortunately, this leads to models that are prohibitively large to be employed in the downstream tasks. In this paper we identify one of the important factors contributing to the large embedding size requirement. In particular, our analysis highlights that the scaling between the number of heads and the size of each head in the current architecture gives rise to a low-rank bottleneck in attention heads, causing this limitation. We further validate this in our experiments. As a solution we propose to set the head size of an attention unit to input sequence length, and independent of the number of heads, resulting in multi-head attention layers with provably more expressive power. We empirically show that this allows us to train models with a relatively smaller embedding dimension and with better performance scaling.

Related papers

SAS: Simulated Attention Score [75.1409882298863]
We introduce Simulated Attention Score (SAS), which maintains a compact model size while simulating a larger number of attention heads and hidden feature dimension per head.<n>Comprehensive experiments on a variety of datasets and tasks demonstrate the effectiveness of the proposed SAS method.
arXiv Detail & Related papers (2025-07-10T12:16:16Z)
Leaner Transformers: More Heads, Less Depth [39.80661571556767]
Transformers have reshaped machine learning by utilizing attention mechanisms to capture complex patterns in large datasets.<n>This paper challenge this belief by showing that many existing transformers might be unnecessarily oversized.<n>We exploit this theoretical insight and redesign popular architectures with an increased number of heads.
arXiv Detail & Related papers (2025-05-27T07:06:54Z)
On the Benefits of Rank in Attention Layers [38.651863218241154]
We show that there are dramatic trade-offs between the rank and number of heads of the attention mechanism. We present experiments with off-the-shelf transformers that validate our findings.
arXiv Detail & Related papers (2024-07-23T03:40:24Z)
PLATON: Pruning Large Transformer Models with Upper Confidence Bound of Weight Importance [114.1541203743303]
We propose PLATON, which captures the uncertainty of importance scores by upper confidence bound (UCB) of importance estimation. We conduct extensive experiments with several Transformer-based models on natural language understanding, question answering and image classification.
arXiv Detail & Related papers (2022-06-25T05:38:39Z)
Multiformer: A Head-Configurable Transformer-Based Model for Direct Speech Translation [0.0]
Multiformer is a Transformer-based model which allows the use of different attention mechanisms on each head. By doing this, the model is able to bias the self-attention towards the extraction of more diverse token interactions. Results show that mixing attention patterns along the different heads and layers outperforms our baseline by up to 0.7 BLEU.
arXiv Detail & Related papers (2022-05-14T17:37:47Z)
SimpleTron: Eliminating Softmax from Attention Computation [68.8204255655161]
We propose that the dot product pairwise matching attention layer is redundant for the model performance. We present a simple and fast alternative without any approximation that, to the best of our knowledge, outperforms existing attention approximations on several tasks from the Long-Range Arena benchmark.
arXiv Detail & Related papers (2021-11-23T17:06:01Z)
Layer-wise Pruning of Transformer Attention Heads for Efficient Language Modeling [22.278610066038954]
Attention head pruning is a promising technique to solve this problem. We propose three training methods that are especially helpful to minimize performance degradation. Our pruned model shows consistently lower perplexity within a comparable parameter size than Transformer-XL on WikiText-103 language modeling benchmark.
arXiv Detail & Related papers (2021-10-07T08:19:26Z)
Differentiable Subset Pruning of Transformer Heads [71.7904179689271]
We introduce a new head pruning technique that we term differentiable subset pruning. We show that differentiable subset pruning performs comparably or better than previous works while offering precise control of the sparsity level.
arXiv Detail & Related papers (2021-08-10T13:08:34Z)
A Dynamic Head Importance Computation Mechanism for Neural Machine Translation [22.784419165117512]
Multiple parallel attention mechanisms that use multiple attention heads facilitate greater performance of the Transformer model for various applications. In this work, we focus on designing a Dynamic Head Importance Computation Mechanism (DHICM) to dynamically calculate the importance of a head with respect to the input. We add an extra loss function to prevent the model from assigning same score to all heads, to identify more important heads and improvise performance.
arXiv Detail & Related papers (2021-08-03T09:16:55Z)
Multi-Head Attention: Collaborate Instead of Concatenate [85.71058762269374]
We propose a collaborative multi-head attention layer that enables heads to learn shared projections. Experiments confirm that sharing key/query dimensions can be exploited in language understanding, machine translation and vision.
arXiv Detail & Related papers (2020-06-29T20:28:52Z)
Fixed Encoder Self-Attention Patterns in Transformer-Based Machine Translation [73.11214377092121]
We propose to replace all but one attention head of each encoder layer with simple fixed -- non-learnable -- attentive patterns. Our experiments with different data sizes and multiple language pairs show that fixing the attention heads on the encoder side of the Transformer at training time does not impact the translation quality.
arXiv Detail & Related papers (2020-02-24T13:53:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.