Pit One Against Many: Leveraging Attention-head Embeddings for
Parameter-efficient Multi-head Attention
- URL: http://arxiv.org/abs/2310.07911v1
- Date: Wed, 11 Oct 2023 21:38:40 GMT
- Title: Pit One Against Many: Leveraging Attention-head Embeddings for
Parameter-efficient Multi-head Attention
- Authors: Huiyin Xue and Nikolaos Aletras
- Abstract summary: We propose an alternative module that uses only a single shared projection matrix and multiple head embeddings (MHE)
We empirically demonstrate that our MHE attention is substantially more memory efficient compared to alternative attention mechanisms.
- Score: 42.92397219764559
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Scaling pre-trained language models has resulted in large performance gains
in various natural language processing tasks but comes with a large cost in
memory requirements. Inspired by the position embeddings in transformers, we
aim to simplify and reduce the memory footprint of the multi-head attention
(MHA) mechanism. We propose an alternative module that uses only a single
shared projection matrix and multiple head embeddings (MHE), i.e. one per head.
We empirically demonstrate that our MHE attention is substantially more memory
efficient compared to alternative attention mechanisms while achieving high
predictive performance retention ratio to vanilla MHA on several downstream
tasks. MHE attention only requires a negligible fraction of additional
parameters ($3nd$, where $n$ is the number of attention heads and $d$ the size
of the head embeddings) compared to a single-head attention, while MHA requires
$(3n^2-3n)d^2-3nd$ additional parameters.
Related papers
- MoH: Multi-Head Attention as Mixture-of-Head Attention [63.67734699877724]
We upgrade the multi-head attention mechanism, the core of the Transformer model, to improve efficiency while maintaining or surpassing the previous accuracy level.
We propose Mixture-of-Head attention (MoH), a new architecture that treats attention heads as experts in the Mixture-of-Experts mechanism.
MoH has two significant advantages: First, MoH enables each token to select the appropriate attention heads, enhancing inference efficiency without compromising accuracy or increasing the number of parameters.
arXiv Detail & Related papers (2024-10-15T17:59:44Z) - Not All Attention is Needed: Parameter and Computation Efficient Transfer Learning for Multi-modal Large Language Models [73.48675708831328]
We propose a novel parameter and computation efficient tuning method for Multi-modal Large Language Models (MLLMs)
The Efficient Attention Skipping (EAS) method evaluates the attention redundancy and skips the less important MHAs to speed up inference.
The experiments show that EAS not only retains high performance and parameter efficiency, but also greatly speeds up inference speed.
arXiv Detail & Related papers (2024-03-22T14:20:34Z) - Tuning Pre-trained Model via Moment Probing [62.445281364055795]
We propose a novel Moment Probing (MP) method to explore the potential of LP.
MP performs a linear classification head based on the mean of final features.
Our MP significantly outperforms LP and is competitive with counterparts at less training cost.
arXiv Detail & Related papers (2023-07-21T04:15:02Z) - Finding the Pillars of Strength for Multi-Head Attention [35.556186723898485]
Recent studies have revealed some issues of Multi-Head Attention (MHA)
We propose Grouped Head Attention, trained with a self-supervised group constraint that group attention heads.
We additionally propose a Voting-to-Stay procedure to remove redundant heads, thus achieving a transformer with lighter weights.
arXiv Detail & Related papers (2023-05-22T03:44:44Z) - Mixture of Attention Heads: Selecting Attention Heads Per Token [40.04159325505842]
Mixture of Attention Heads (MoA) is a new architecture that combines multi-head attention with the MoE mechanism.
MoA achieves stronger performance than the standard multi-head attention layer.
MoA also automatically differentiates heads' utilities, providing a new perspective to discuss the model's interpretability.
arXiv Detail & Related papers (2022-10-11T04:54:05Z) - A Dynamic Head Importance Computation Mechanism for Neural Machine
Translation [22.784419165117512]
Multiple parallel attention mechanisms that use multiple attention heads facilitate greater performance of the Transformer model for various applications.
In this work, we focus on designing a Dynamic Head Importance Computation Mechanism (DHICM) to dynamically calculate the importance of a head with respect to the input.
We add an extra loss function to prevent the model from assigning same score to all heads, to identify more important heads and improvise performance.
arXiv Detail & Related papers (2021-08-03T09:16:55Z) - Multi-Head Attention: Collaborate Instead of Concatenate [85.71058762269374]
We propose a collaborative multi-head attention layer that enables heads to learn shared projections.
Experiments confirm that sharing key/query dimensions can be exploited in language understanding, machine translation and vision.
arXiv Detail & Related papers (2020-06-29T20:28:52Z) - Low-Rank Bottleneck in Multi-head Attention Models [74.83235382203604]
We argue that the scaling between the number of heads and the size of each head in the current architecture gives rise to a low-rank bottleneck in attention heads.
We propose to set the head size of an attention unit to input sequence length, and independent of the number of heads, resulting in multi-head attention layers with provably more expressive power.
arXiv Detail & Related papers (2020-02-17T16:16:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.