Multi-Head Attention: Collaborate Instead of Concatenate
- URL: http://arxiv.org/abs/2006.16362v2
- Date: Thu, 20 May 2021 14:48:30 GMT
- Title: Multi-Head Attention: Collaborate Instead of Concatenate
- Authors: Jean-Baptiste Cordonnier, Andreas Loukas, Martin Jaggi
- Abstract summary: We propose a collaborative multi-head attention layer that enables heads to learn shared projections.
Experiments confirm that sharing key/query dimensions can be exploited in language understanding, machine translation and vision.
- Score: 85.71058762269374
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Attention layers are widely used in natural language processing (NLP) and are
beginning to influence computer vision architectures. Training very large
transformer models allowed significant improvement in both fields, but once
trained, these networks show symptoms of over-parameterization. For instance,
it is known that many attention heads can be pruned without impacting accuracy.
This work aims to enhance current understanding on how multiple heads interact.
Motivated by the observation that attention heads learn redundant key/query
projections, we propose a collaborative multi-head attention layer that enables
heads to learn shared projections. Our scheme decreases the number of
parameters in an attention layer and can be used as a drop-in replacement in
any transformer architecture. Our experiments confirm that sharing key/query
dimensions can be exploited in language understanding, machine translation and
vision. We also show that it is possible to re-parametrize a pre-trained
multi-head attention layer into our collaborative attention layer.
Collaborative multi-head attention reduces the size of the key and query
projections by 4 for same accuracy and speed. Our code is public.
Related papers
- A Primal-Dual Framework for Transformers and Neural Networks [52.814467832108875]
Self-attention is key to the remarkable success of transformers in sequence modeling tasks.
We show that the self-attention corresponds to the support vector expansion derived from a support vector regression problem.
We propose two new attentions: Batch Normalized Attention (Attention-BN) and Attention with Scaled Head (Attention-SH)
arXiv Detail & Related papers (2024-06-19T19:11:22Z) - What Can Transformer Learn with Varying Depth? Case Studies on Sequence Learning Tasks [15.874604623294427]
We show a transformer with only one attention layer can excel in memorization but falls short in other tasks.
We identify a class of simple operations that a single attention layer can execute, and show that the complex tasks can be approached as the combinations of these simple operations.
arXiv Detail & Related papers (2024-04-02T02:45:12Z) - Convolution-enhanced Evolving Attention Networks [41.684265133316096]
Evolving Attention-enhanced Dilated Convolutional (EA-DC-) Transformer outperforms state-of-the-art models significantly.
This is the first work that explicitly models the layer-wise evolution of attention maps.
arXiv Detail & Related papers (2022-12-16T08:14:04Z) - PSViT: Better Vision Transformer via Token Pooling and Attention Sharing [114.8051035856023]
We propose a PSViT: a ViT with token Pooling and attention Sharing to reduce the redundancy.
Experimental results show that the proposed scheme can achieve up to 6.6% accuracy improvement in ImageNet classification.
arXiv Detail & Related papers (2021-08-07T11:30:54Z) - Evolving Attention with Residual Convolutions [29.305149185821882]
We propose a novel mechanism based on evolving attention to improve the performance of transformers.
The proposed attention mechanism achieves significant performance improvement over various state-of-the-art models for multiple tasks.
arXiv Detail & Related papers (2021-02-20T15:24:06Z) - Learning Hard Retrieval Decoder Attention for Transformers [69.40942736249397]
Transformer translation model is based on the multi-head attention mechanism, which can be parallelized easily.
We show that our hard retrieval attention mechanism is 1.43 times faster in decoding.
arXiv Detail & Related papers (2020-09-30T13:18:57Z) - Fixed Encoder Self-Attention Patterns in Transformer-Based Machine
Translation [73.11214377092121]
We propose to replace all but one attention head of each encoder layer with simple fixed -- non-learnable -- attentive patterns.
Our experiments with different data sizes and multiple language pairs show that fixing the attention heads on the encoder side of the Transformer at training time does not impact the translation quality.
arXiv Detail & Related papers (2020-02-24T13:53:06Z) - Low-Rank Bottleneck in Multi-head Attention Models [74.83235382203604]
We argue that the scaling between the number of heads and the size of each head in the current architecture gives rise to a low-rank bottleneck in attention heads.
We propose to set the head size of an attention unit to input sequence length, and independent of the number of heads, resulting in multi-head attention layers with provably more expressive power.
arXiv Detail & Related papers (2020-02-17T16:16:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.