Related papers: Multi-head or Single-head? An Empirical Comparison for Transformer Training

Multi-head or Single-head? An Empirical Comparison for Transformer Training

URL: http://arxiv.org/abs/2106.09650v1
Date: Thu, 17 Jun 2021 16:53:22 GMT
Title: Multi-head or Single-head? An Empirical Comparison for Transformer Training
Authors: Liyuan Liu and Jialu Liu and Jiawei Han
Abstract summary: Multi-head attention plays a crucial role in the recent success of Transformer models. We show that jointly attending multiple positions is not a unique feature of multi-head attention. We show that, with recent advances in deep learning, we can successfully stabilize the training of the 384-layer Transformer.
Score: 62.272657851060465
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multi-head attention plays a crucial role in the recent success of Transformer models, which leads to consistent performance improvements over conventional attention in various applications. The popular belief is that this effectiveness stems from the ability of jointly attending multiple positions. In this paper, we first demonstrate that jointly attending multiple positions is not a unique feature of multi-head attention, as multi-layer single-head attention also attends multiple positions and is more effective. Then, we suggest the main advantage of the multi-head attention is the training stability, since it has less number of layers than the single-head attention, when attending the same number of positions. For example, 24-layer 16-head Transformer (BERT-large) and 384-layer single-head Transformer has the same total attention head number and roughly the same model size, while the multi-head one is significantly shallower. Meanwhile, we show that, with recent advances in deep learning, we can successfully stabilize the training of the 384-layer Transformer. As the training difficulty is no longer a bottleneck, substantially deeper single-head Transformer achieves consistent performance improvements without tuning hyper-parameters.

Related papers

Leaner Transformers: More Heads, Less Depth [39.80661571556767]
Transformers have reshaped machine learning by utilizing attention mechanisms to capture complex patterns in large datasets.<n>This paper challenge this belief by showing that many existing transformers might be unnecessarily oversized.<n>We exploit this theoretical insight and redesign popular architectures with an increased number of heads.
arXiv Detail & Related papers (2025-05-27T07:06:54Z)
Superiority of Multi-Head Attention in In-Context Linear Regression [39.469021333473435]
We conduct an exact theoretical analysis to demonstrate that multi-head attention with a substantial embedding dimension performs better than single-head attention. In general, multi-head attention is preferred over single-head attention.
arXiv Detail & Related papers (2024-01-30T20:29:06Z)
Wide Attention Is The Way Forward For Transformers [9.252523881586054]
We show that wide single layer Transformer models can compete with or outperform deeper ones in a variety of Natural Language Processing (NLP) tasks. Our results suggest that the critical direction for building better Transformers for NLP is their width, and that their depth is less relevant.
arXiv Detail & Related papers (2022-10-02T21:49:54Z)
Transformer with a Mixture of Gaussian Keys [31.91701434633319]
Multi-head attention is a driving force behind state-of-the-art transformers. Transformer-MGK replaces redundant heads in transformers with a mixture of keys at each head. Compared to its conventional transformer counterpart, Transformer-MGK accelerates training and inference, has fewer parameters, and requires less FLOPs to compute.
arXiv Detail & Related papers (2021-10-16T23:43:24Z)
Differentiable Subset Pruning of Transformer Heads [71.7904179689271]
We introduce a new head pruning technique that we term differentiable subset pruning. We show that differentiable subset pruning performs comparably or better than previous works while offering precise control of the sparsity level.
arXiv Detail & Related papers (2021-08-10T13:08:34Z)
Multi-Head Attention: Collaborate Instead of Concatenate [85.71058762269374]
We propose a collaborative multi-head attention layer that enables heads to learn shared projections. Experiments confirm that sharing key/query dimensions can be exploited in language understanding, machine translation and vision.
arXiv Detail & Related papers (2020-06-29T20:28:52Z)
Multi-branch Attentive Transformer [152.07840447196384]
We propose a simple yet effective variant of Transformer called multi-branch attentive Transformer. The attention layer is the average of multiple branches and each branch is an independent multi-head attention layer. Experiments on machine translation, code generation and natural language understanding demonstrate that such a simple variant of Transformer brings significant improvements.
arXiv Detail & Related papers (2020-06-18T04:24:28Z)
Fixed Encoder Self-Attention Patterns in Transformer-Based Machine Translation [73.11214377092121]
We propose to replace all but one attention head of each encoder layer with simple fixed -- non-learnable -- attentive patterns. Our experiments with different data sizes and multiple language pairs show that fixing the attention heads on the encoder side of the Transformer at training time does not impact the translation quality.
arXiv Detail & Related papers (2020-02-24T13:53:06Z)
Low-Rank Bottleneck in Multi-head Attention Models [74.83235382203604]
We argue that the scaling between the number of heads and the size of each head in the current architecture gives rise to a low-rank bottleneck in attention heads. We propose to set the head size of an attention unit to input sequence length, and independent of the number of heads, resulting in multi-head attention layers with provably more expressive power.
arXiv Detail & Related papers (2020-02-17T16:16:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.