Multi-head or Single-head? An Empirical Comparison for Transformer
Training
- URL: http://arxiv.org/abs/2106.09650v1
- Date: Thu, 17 Jun 2021 16:53:22 GMT
- Title: Multi-head or Single-head? An Empirical Comparison for Transformer
Training
- Authors: Liyuan Liu and Jialu Liu and Jiawei Han
- Abstract summary: Multi-head attention plays a crucial role in the recent success of Transformer models.
We show that jointly attending multiple positions is not a unique feature of multi-head attention.
We show that, with recent advances in deep learning, we can successfully stabilize the training of the 384-layer Transformer.
- Score: 62.272657851060465
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multi-head attention plays a crucial role in the recent success of
Transformer models, which leads to consistent performance improvements over
conventional attention in various applications. The popular belief is that this
effectiveness stems from the ability of jointly attending multiple positions.
In this paper, we first demonstrate that jointly attending multiple positions
is not a unique feature of multi-head attention, as multi-layer single-head
attention also attends multiple positions and is more effective. Then, we
suggest the main advantage of the multi-head attention is the training
stability, since it has less number of layers than the single-head attention,
when attending the same number of positions. For example, 24-layer 16-head
Transformer (BERT-large) and 384-layer single-head Transformer has the same
total attention head number and roughly the same model size, while the
multi-head one is significantly shallower. Meanwhile, we show that, with recent
advances in deep learning, we can successfully stabilize the training of the
384-layer Transformer. As the training difficulty is no longer a bottleneck,
substantially deeper single-head Transformer achieves consistent performance
improvements without tuning hyper-parameters.
Related papers
- Superiority of Multi-Head Attention in In-Context Linear Regression [39.469021333473435]
We conduct an exact theoretical analysis to demonstrate that multi-head attention with a substantial embedding dimension performs better than single-head attention.
In general, multi-head attention is preferred over single-head attention.
arXiv Detail & Related papers (2024-01-30T20:29:06Z) - Wide Attention Is The Way Forward For Transformers [9.252523881586054]
We show that wide single layer Transformer models can compete with or outperform deeper ones in a variety of Natural Language Processing (NLP) tasks.
Our results suggest that the critical direction for building better Transformers for NLP is their width, and that their depth is less relevant.
arXiv Detail & Related papers (2022-10-02T21:49:54Z) - Transformer with a Mixture of Gaussian Keys [31.91701434633319]
Multi-head attention is a driving force behind state-of-the-art transformers.
Transformer-MGK replaces redundant heads in transformers with a mixture of keys at each head.
Compared to its conventional transformer counterpart, Transformer-MGK accelerates training and inference, has fewer parameters, and requires less FLOPs to compute.
arXiv Detail & Related papers (2021-10-16T23:43:24Z) - Differentiable Subset Pruning of Transformer Heads [71.7904179689271]
We introduce a new head pruning technique that we term differentiable subset pruning.
We show that differentiable subset pruning performs comparably or better than previous works while offering precise control of the sparsity level.
arXiv Detail & Related papers (2021-08-10T13:08:34Z) - Multi-Head Attention: Collaborate Instead of Concatenate [85.71058762269374]
We propose a collaborative multi-head attention layer that enables heads to learn shared projections.
Experiments confirm that sharing key/query dimensions can be exploited in language understanding, machine translation and vision.
arXiv Detail & Related papers (2020-06-29T20:28:52Z) - Multi-branch Attentive Transformer [152.07840447196384]
We propose a simple yet effective variant of Transformer called multi-branch attentive Transformer.
The attention layer is the average of multiple branches and each branch is an independent multi-head attention layer.
Experiments on machine translation, code generation and natural language understanding demonstrate that such a simple variant of Transformer brings significant improvements.
arXiv Detail & Related papers (2020-06-18T04:24:28Z) - Fixed Encoder Self-Attention Patterns in Transformer-Based Machine
Translation [73.11214377092121]
We propose to replace all but one attention head of each encoder layer with simple fixed -- non-learnable -- attentive patterns.
Our experiments with different data sizes and multiple language pairs show that fixing the attention heads on the encoder side of the Transformer at training time does not impact the translation quality.
arXiv Detail & Related papers (2020-02-24T13:53:06Z) - Low-Rank Bottleneck in Multi-head Attention Models [74.83235382203604]
We argue that the scaling between the number of heads and the size of each head in the current architecture gives rise to a low-rank bottleneck in attention heads.
We propose to set the head size of an attention unit to input sequence length, and independent of the number of heads, resulting in multi-head attention layers with provably more expressive power.
arXiv Detail & Related papers (2020-02-17T16:16:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.