Related papers: Pit One Against Many: Leveraging Attention-head Embeddings for Parameter-efficient Multi-head Attention

Pit One Against Many: Leveraging Attention-head Embeddings for Parameter-efficient Multi-head Attention

URL: http://arxiv.org/abs/2310.07911v1
Date: Wed, 11 Oct 2023 21:38:40 GMT
Title: Pit One Against Many: Leveraging Attention-head Embeddings for Parameter-efficient Multi-head Attention
Authors: Huiyin Xue and Nikolaos Aletras
Abstract summary: We propose an alternative module that uses only a single shared projection matrix and multiple head embeddings (MHE) We empirically demonstrate that our MHE attention is substantially more memory efficient compared to alternative attention mechanisms.
Score: 42.92397219764559
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Scaling pre-trained language models has resulted in large performance gains in various natural language processing tasks but comes with a large cost in memory requirements. Inspired by the position embeddings in transformers, we aim to simplify and reduce the memory footprint of the multi-head attention (MHA) mechanism. We propose an alternative module that uses only a single shared projection matrix and multiple head embeddings (MHE), i.e. one per head. We empirically demonstrate that our MHE attention is substantially more memory efficient compared to alternative attention mechanisms while achieving high predictive performance retention ratio to vanilla MHA on several downstream tasks. MHE attention only requires a negligible fraction of additional parameters ($3nd$, where $n$ is the number of attention heads and $d$ the size of the head embeddings) compared to a single-head attention, while MHA requires $(3n^2-3n)d^2-3nd$ additional parameters.

Related papers

Mixture of Hidden-Dimensions Transformer [50.40325486463241]
We study hidden dimension sparsity and observe that trained Transformers utilize only a small fraction of token dimensions. We propose MoHD (Mixture of Hidden Dimensions), a sparse conditional activation architecture. It achieves 1.7% higher performance with 50% fewer activation parameters and 3.7% higher performance with a 3x parameter expansion at constant activation cost.
arXiv Detail & Related papers (2024-12-07T13:15:22Z)
MoH: Multi-Head Attention as Mixture-of-Head Attention [63.67734699877724]
We upgrade the multi-head attention mechanism, the core of the Transformer model, to improve efficiency while maintaining or surpassing the previous accuracy level. We propose Mixture-of-Head attention (MoH), a new architecture that treats attention heads as experts in the Mixture-of-Experts mechanism. MoH has two significant advantages: First, MoH enables each token to select the appropriate attention heads, enhancing inference efficiency without compromising accuracy or increasing the number of parameters.
arXiv Detail & Related papers (2024-10-15T17:59:44Z)
Not All Attention is Needed: Parameter and Computation Efficient Transfer Learning for Multi-modal Large Language Models [73.48675708831328]
We propose a novel parameter and computation efficient tuning method for Multi-modal Large Language Models (MLLMs) The Efficient Attention Skipping (EAS) method evaluates the attention redundancy and skips the less important MHAs to speed up inference. The experiments show that EAS not only retains high performance and parameter efficiency, but also greatly speeds up inference speed.
arXiv Detail & Related papers (2024-03-22T14:20:34Z)
Simple linear attention language models balance the recall-throughput tradeoff [60.06020449520365]
We propose BASED, a simple architecture combining linear and sliding window attention. We train language models up to 1.3b parameters and show that BASED matches the strongest sub-quadratic models in perplexity and outperforms them on real-world recall-intensive tasks by 6.22 accuracy points.
arXiv Detail & Related papers (2024-02-28T19:28:27Z)
Tuning Pre-trained Model via Moment Probing [62.445281364055795]
We propose a novel Moment Probing (MP) method to explore the potential of LP. MP performs a linear classification head based on the mean of final features. Our MP significantly outperforms LP and is competitive with counterparts at less training cost.
arXiv Detail & Related papers (2023-07-21T04:15:02Z)
Finding the Pillars of Strength for Multi-Head Attention [35.556186723898485]
Recent studies have revealed some issues of Multi-Head Attention (MHA) We propose Grouped Head Attention, trained with a self-supervised group constraint that group attention heads. We additionally propose a Voting-to-Stay procedure to remove redundant heads, thus achieving a transformer with lighter weights.
arXiv Detail & Related papers (2023-05-22T03:44:44Z)
Mixture of Attention Heads: Selecting Attention Heads Per Token [40.04159325505842]
Mixture of Attention Heads (MoA) is a new architecture that combines multi-head attention with the MoE mechanism. MoA achieves stronger performance than the standard multi-head attention layer. MoA also automatically differentiates heads' utilities, providing a new perspective to discuss the model's interpretability.
arXiv Detail & Related papers (2022-10-11T04:54:05Z)
A Dynamic Head Importance Computation Mechanism for Neural Machine Translation [22.784419165117512]
Multiple parallel attention mechanisms that use multiple attention heads facilitate greater performance of the Transformer model for various applications. In this work, we focus on designing a Dynamic Head Importance Computation Mechanism (DHICM) to dynamically calculate the importance of a head with respect to the input. We add an extra loss function to prevent the model from assigning same score to all heads, to identify more important heads and improvise performance.
arXiv Detail & Related papers (2021-08-03T09:16:55Z)
Multi-Head Attention: Collaborate Instead of Concatenate [85.71058762269374]
We propose a collaborative multi-head attention layer that enables heads to learn shared projections. Experiments confirm that sharing key/query dimensions can be exploited in language understanding, machine translation and vision.
arXiv Detail & Related papers (2020-06-29T20:28:52Z)
Low-Rank Bottleneck in Multi-head Attention Models [74.83235382203604]
We argue that the scaling between the number of heads and the size of each head in the current architecture gives rise to a low-rank bottleneck in attention heads. We propose to set the head size of an attention unit to input sequence length, and independent of the number of heads, resulting in multi-head attention layers with provably more expressive power.
arXiv Detail & Related papers (2020-02-17T16:16:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.