Low-Rank Bottleneck in Multi-head Attention Models
- URL: http://arxiv.org/abs/2002.07028v1
- Date: Mon, 17 Feb 2020 16:16:40 GMT
- Title: Low-Rank Bottleneck in Multi-head Attention Models
- Authors: Srinadh Bhojanapalli, Chulhee Yun, Ankit Singh Rawat, Sashank J.
Reddi, Sanjiv Kumar
- Abstract summary: We argue that the scaling between the number of heads and the size of each head in the current architecture gives rise to a low-rank bottleneck in attention heads.
We propose to set the head size of an attention unit to input sequence length, and independent of the number of heads, resulting in multi-head attention layers with provably more expressive power.
- Score: 74.83235382203604
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Attention based Transformer architecture has enabled significant advances in
the field of natural language processing. In addition to new pre-training
techniques, recent improvements crucially rely on working with a relatively
larger embedding dimension for tokens. Unfortunately, this leads to models that
are prohibitively large to be employed in the downstream tasks. In this paper
we identify one of the important factors contributing to the large embedding
size requirement. In particular, our analysis highlights that the scaling
between the number of heads and the size of each head in the current
architecture gives rise to a low-rank bottleneck in attention heads, causing
this limitation. We further validate this in our experiments. As a solution we
propose to set the head size of an attention unit to input sequence length, and
independent of the number of heads, resulting in multi-head attention layers
with provably more expressive power. We empirically show that this allows us to
train models with a relatively smaller embedding dimension and with better
performance scaling.
Related papers
- On the Benefits of Rank in Attention Layers [38.651863218241154]
We show that there are dramatic trade-offs between the rank and number of heads of the attention mechanism.
We present experiments with off-the-shelf transformers that validate our findings.
arXiv Detail & Related papers (2024-07-23T03:40:24Z) - PLATON: Pruning Large Transformer Models with Upper Confidence Bound of
Weight Importance [114.1541203743303]
We propose PLATON, which captures the uncertainty of importance scores by upper confidence bound (UCB) of importance estimation.
We conduct extensive experiments with several Transformer-based models on natural language understanding, question answering and image classification.
arXiv Detail & Related papers (2022-06-25T05:38:39Z) - Multiformer: A Head-Configurable Transformer-Based Model for Direct
Speech Translation [0.0]
Multiformer is a Transformer-based model which allows the use of different attention mechanisms on each head.
By doing this, the model is able to bias the self-attention towards the extraction of more diverse token interactions.
Results show that mixing attention patterns along the different heads and layers outperforms our baseline by up to 0.7 BLEU.
arXiv Detail & Related papers (2022-05-14T17:37:47Z) - SimpleTron: Eliminating Softmax from Attention Computation [68.8204255655161]
We propose that the dot product pairwise matching attention layer is redundant for the model performance.
We present a simple and fast alternative without any approximation that, to the best of our knowledge, outperforms existing attention approximations on several tasks from the Long-Range Arena benchmark.
arXiv Detail & Related papers (2021-11-23T17:06:01Z) - Layer-wise Pruning of Transformer Attention Heads for Efficient Language
Modeling [22.278610066038954]
Attention head pruning is a promising technique to solve this problem.
We propose three training methods that are especially helpful to minimize performance degradation.
Our pruned model shows consistently lower perplexity within a comparable parameter size than Transformer-XL on WikiText-103 language modeling benchmark.
arXiv Detail & Related papers (2021-10-07T08:19:26Z) - Differentiable Subset Pruning of Transformer Heads [71.7904179689271]
We introduce a new head pruning technique that we term differentiable subset pruning.
We show that differentiable subset pruning performs comparably or better than previous works while offering precise control of the sparsity level.
arXiv Detail & Related papers (2021-08-10T13:08:34Z) - A Dynamic Head Importance Computation Mechanism for Neural Machine
Translation [22.784419165117512]
Multiple parallel attention mechanisms that use multiple attention heads facilitate greater performance of the Transformer model for various applications.
In this work, we focus on designing a Dynamic Head Importance Computation Mechanism (DHICM) to dynamically calculate the importance of a head with respect to the input.
We add an extra loss function to prevent the model from assigning same score to all heads, to identify more important heads and improvise performance.
arXiv Detail & Related papers (2021-08-03T09:16:55Z) - Multi-Head Attention: Collaborate Instead of Concatenate [85.71058762269374]
We propose a collaborative multi-head attention layer that enables heads to learn shared projections.
Experiments confirm that sharing key/query dimensions can be exploited in language understanding, machine translation and vision.
arXiv Detail & Related papers (2020-06-29T20:28:52Z) - Fixed Encoder Self-Attention Patterns in Transformer-Based Machine
Translation [73.11214377092121]
We propose to replace all but one attention head of each encoder layer with simple fixed -- non-learnable -- attentive patterns.
Our experiments with different data sizes and multiple language pairs show that fixing the attention heads on the encoder side of the Transformer at training time does not impact the translation quality.
arXiv Detail & Related papers (2020-02-24T13:53:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.