MoH: Multi-Head Attention as Mixture-of-Head Attention
- URL: http://arxiv.org/abs/2410.11842v1
- Date: Tue, 15 Oct 2024 17:59:44 GMT
- Title: MoH: Multi-Head Attention as Mixture-of-Head Attention
- Authors: Peng Jin, Bo Zhu, Li Yuan, Shuicheng Yan,
- Abstract summary: We upgrade the multi-head attention mechanism, the core of the Transformer model, to improve efficiency while maintaining or surpassing the previous accuracy level.
We propose Mixture-of-Head attention (MoH), a new architecture that treats attention heads as experts in the Mixture-of-Experts mechanism.
MoH has two significant advantages: First, MoH enables each token to select the appropriate attention heads, enhancing inference efficiency without compromising accuracy or increasing the number of parameters.
- Score: 63.67734699877724
- License:
- Abstract: In this work, we upgrade the multi-head attention mechanism, the core of the Transformer model, to improve efficiency while maintaining or surpassing the previous accuracy level. We show that multi-head attention can be expressed in the summation form. Drawing on the insight that not all attention heads hold equal significance, we propose Mixture-of-Head attention (MoH), a new architecture that treats attention heads as experts in the Mixture-of-Experts (MoE) mechanism. MoH has two significant advantages: First, MoH enables each token to select the appropriate attention heads, enhancing inference efficiency without compromising accuracy or increasing the number of parameters. Second, MoH replaces the standard summation in multi-head attention with a weighted summation, introducing flexibility to the attention mechanism and unlocking extra performance potential. Extensive experiments on ViT, DiT, and LLMs demonstrate that MoH outperforms multi-head attention by using only 50%-90% of the attention heads. Moreover, we demonstrate that pre-trained multi-head attention models, such as LLaMA3-8B, can be further continue-tuned into our MoH models. Notably, MoH-LLaMA3-8B achieves an average accuracy of 64.0% across 14 benchmarks, outperforming LLaMA3-8B by 2.4% by utilizing only 75% of the attention heads. We believe the proposed MoH is a promising alternative to multi-head attention and provides a strong foundation for developing advanced and efficient attention-based models.
Related papers
- MC-MoE: Mixture Compressor for Mixture-of-Experts LLMs Gains More [71.0473038084673]
We propose MC-MoE, a training-free Mixture-Compressor for Mixture-of-Experts large language models (MoE-LLMs)
MC-MoE leverages the significance of both experts and tokens to achieve an extreme compression.
For instance, at 2.54 bits, MC-MoE compresses 76.6% of the model, with only a 3.8% average accuracy loss.
arXiv Detail & Related papers (2024-10-08T18:09:38Z) - Not All Attention is Needed: Parameter and Computation Efficient Transfer Learning for Multi-modal Large Language Models [73.48675708831328]
We propose a novel parameter and computation efficient tuning method for Multi-modal Large Language Models (MLLMs)
The Efficient Attention Skipping (EAS) method evaluates the attention redundancy and skips the less important MHAs to speed up inference.
The experiments show that EAS not only retains high performance and parameter efficiency, but also greatly speeds up inference speed.
arXiv Detail & Related papers (2024-03-22T14:20:34Z) - Pit One Against Many: Leveraging Attention-head Embeddings for
Parameter-efficient Multi-head Attention [42.92397219764559]
We propose an alternative module that uses only a single shared projection matrix and multiple head embeddings (MHE)
We empirically demonstrate that our MHE attention is substantially more memory efficient compared to alternative attention mechanisms.
arXiv Detail & Related papers (2023-10-11T21:38:40Z) - Finding the Pillars of Strength for Multi-Head Attention [35.556186723898485]
Recent studies have revealed some issues of Multi-Head Attention (MHA)
We propose Grouped Head Attention, trained with a self-supervised group constraint that group attention heads.
We additionally propose a Voting-to-Stay procedure to remove redundant heads, thus achieving a transformer with lighter weights.
arXiv Detail & Related papers (2023-05-22T03:44:44Z) - Mixture of Attention Heads: Selecting Attention Heads Per Token [40.04159325505842]
Mixture of Attention Heads (MoA) is a new architecture that combines multi-head attention with the MoE mechanism.
MoA achieves stronger performance than the standard multi-head attention layer.
MoA also automatically differentiates heads' utilities, providing a new perspective to discuss the model's interpretability.
arXiv Detail & Related papers (2022-10-11T04:54:05Z) - M2A: Motion Aware Attention for Accurate Video Action Recognition [86.67413715815744]
We develop a new attention mechanism called Motion Aware Attention (M2A) that explicitly incorporates motion characteristics.
M2A extracts motion information between consecutive frames and utilizes attention to focus on the motion patterns found across frames to accurately recognize actions in videos.
We show that incorporating motion mechanisms with attention mechanisms using the proposed M2A mechanism can lead to a +15% to +26% improvement in top-1 accuracy across different backbone architectures.
arXiv Detail & Related papers (2021-11-18T23:38:09Z) - Repulsive Attention: Rethinking Multi-head Attention as Bayesian
Inference [68.12511526813991]
We provide a novel understanding of multi-head attention from a Bayesian perspective.
We propose a non-parametric approach that explicitly improves the repulsiveness in multi-head attention.
Experiments on various attention models and applications demonstrate that the proposed repulsive attention can improve the learned feature diversity.
arXiv Detail & Related papers (2020-09-20T06:32:23Z) - Multi-head Monotonic Chunkwise Attention For Online Speech Recognition [12.619595173472465]
We propose multi-head monotonic chunk-wise attention (MTH-MoChA), an improved version of MoChA.
MTH-MoChA splits the input sequence into small chunks and computes multi-head attentions over the chunks.
Experiments on AISHELL-1 data show that the proposed model, along with the training strategies, improve the character error rate (CER) of MoChA from 8.96% to 7.68% on test set.
arXiv Detail & Related papers (2020-05-01T04:00:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.