Related papers: You Need to Pay Better Attention: Rethinking the Mathematics of Attention Mechanism

You Need to Pay Better Attention: Rethinking the Mathematics of Attention Mechanism

URL: http://arxiv.org/abs/2403.01643v2
Date: Thu, 30 May 2024 17:46:22 GMT
Title: You Need to Pay Better Attention: Rethinking the Mathematics of Attention Mechanism
Authors: Mehran Hosseini, Peyman Hosseini,
Abstract summary: Scaled Dot Product Attention (SDPA) is the backbone of many modern deep-learning models. This paper introduces three enhanced attention mechanisms: Optimised, Efficient, and Super Attention. Super Attention introduces a new linear transformation on the values, transforming them from the left.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Scaled Dot Product Attention (SDPA) is the backbone of many modern deep-learning models. It is so versatile that it has been used in natural language, vision, and multi-modal domains with very little change compared to its original formulation. This paper discusses why the current formulation is inefficient by delving into the mathematical details of the attention mechanism. We propose three improvements to mitigate these inefficiencies, thereby, introducing three enhanced attention mechanisms: Optimised, Efficient, and Super Attention. Optimised and Efficient Attention have one and two matrix multiplications fewer per head, respectively, and 25% and 50% fewer parameters, respectively, than standard SDPA, but perform similarly to standard SDPA in both vision and natural language tasks. They can be used in all applications where SDPA is used while offering smaller model sizes and faster training and inference without noticeable loss in performance. Super Attention introduces a new linear transformation on the values, transforming them from the left. It outperforms standard SPDA on vision and natural language tasks by up to 17% while having one fewer matrix multiplication per head and 25% fewer parameters than standard SDPA. Consequently, it is also faster than standard SDPA. Super Attention is ideal in applications where the attention layer's context length is fixed, such as Vision Transformers. In addition to providing mathematical reasoning, we evaluate the presented attention mechanisms on several datasets including MNIST, CIFAR100, ImageNet, IMDB Movie Reviews, and Amazon Reviews datasets, as well as combined Europarl and Anki English-Spanish datasets for neural machine translation.

Related papers

PADRe: A Unifying Polynomial Attention Drop-in Replacement for Efficient Vision Transformer [33.71410239689095]
PADRe is a framework designed to replace the conventional self-attention mechanism in transformer models. PADRe's key components include multiplicative nonlinearities, which we implement using straightforward, hardware-friendly operations. We assess the effectiveness of PADRe as a drop-in replacement for self-attention across diverse computer vision tasks.
arXiv Detail & Related papers (2024-07-16T01:45:44Z)
Not All Attention is Needed: Parameter and Computation Efficient Transfer Learning for Multi-modal Large Language Models [73.48675708831328]
We propose a novel parameter and computation efficient tuning method for Multi-modal Large Language Models (MLLMs) The Efficient Attention Skipping (EAS) method evaluates the attention redundancy and skips the less important MHAs to speed up inference. The experiments show that EAS not only retains high performance and parameter efficiency, but also greatly speeds up inference speed.
arXiv Detail & Related papers (2024-03-22T14:20:34Z)
Simple linear attention language models balance the recall-throughput tradeoff [40.08746299497935]
We propose BASED, a simple architecture combining linear and sliding window attention. We train language models up to 1.3b parameters and show that BASED matches the strongest sub-quadratic models in perplexity and outperforms them on real-world recall-intensive tasks by 6.22 accuracy points.
arXiv Detail & Related papers (2024-02-28T19:28:27Z)
Pit One Against Many: Leveraging Attention-head Embeddings for Parameter-efficient Multi-head Attention [42.92397219764559]
We propose an alternative module that uses only a single shared projection matrix and multiple head embeddings (MHE) We empirically demonstrate that our MHE attention is substantially more memory efficient compared to alternative attention mechanisms.
arXiv Detail & Related papers (2023-10-11T21:38:40Z)
SEA: Sparse Linear Attention with Estimated Attention Mask [51.22399593954608]
Long seqeuences pose a problem due to the quadratic complexity of the attention operation. Previous research has aimed to lower the complexity by sparsifying or linearly approximating the attention matrix. We propose SEA: Sparse linear attention with an Estimated Attention mask.
arXiv Detail & Related papers (2023-10-03T03:56:26Z)
Quantizable Transformers: Removing Outliers by Helping Attention Heads Do Nothing [18.673619610942197]
Modern transformer models tend to learn strong outliers in their activations, making them difficult to quantize. We show that strong outliers are related to very specific behavior of attention heads that try to learn a "no-op" or just a partial update of the residual. We propose two simple (independent) modifications to the attention mechanism - clipped softmax and gated attention.
arXiv Detail & Related papers (2023-06-22T14:39:04Z)
SwiftFormer: Efficient Additive Attention for Transformer-based Real-time Mobile Vision Applications [98.90623605283564]
We introduce a novel efficient additive attention mechanism that effectively replaces the quadratic matrix multiplication operations with linear element-wise multiplications. We build a series of models called "SwiftFormer" which achieves state-of-the-art performance in terms of both accuracy and mobile inference speed. Our small variant achieves 78.5% top-1 ImageNet-1K accuracy with only 0.8 ms latency on iPhone 14, which is more accurate and 2x faster compared to MobileViT-v2.
arXiv Detail & Related papers (2023-03-27T17:59:58Z)
UNETR++: Delving into Efficient and Accurate 3D Medical Image Segmentation [93.88170217725805]
We propose a 3D medical image segmentation approach, named UNETR++, that offers both high-quality segmentation masks as well as efficiency in terms of parameters, compute cost, and inference speed. The core of our design is the introduction of a novel efficient paired attention (EPA) block that efficiently learns spatial and channel-wise discriminative features. Our evaluations on five benchmarks, Synapse, BTCV, ACDC, BRaTs, and Decathlon-Lung, reveal the effectiveness of our contributions in terms of both efficiency and accuracy.
arXiv Detail & Related papers (2022-12-08T18:59:57Z)
How Much Does Attention Actually Attend? Questioning the Importance of Attention in Pretrained Transformers [59.57128476584361]
We introduce PAPA, a new probing method that replaces the input-dependent attention matrices with constant ones. We find that without any input-dependent attention, all models achieve competitive performance. We show that better-performing models lose more from applying our method than weaker models, suggesting that the utilization of the input-dependent attention mechanism might be a factor in their success.
arXiv Detail & Related papers (2022-11-07T12:37:54Z)
Monarch: Expressive Structured Matrices for Efficient and Accurate Training [64.6871423399431]
Large neural networks excel in many domains, but they are expensive to train and fine-tune. A popular approach to reduce their compute or memory requirements is to replace dense weight matrices with structured ones. We propose a class of matrices (Monarch) that is hardware-efficient.
arXiv Detail & Related papers (2022-04-01T17:37:29Z)
Data Augmentation for Copy-Mechanism in Dialogue State Tracking [30.768655511224527]
We find out the factors that influence the generalization ability of a common copy-mechanism model for dialogue state tracking (DST) We propose a simple but effective algorithm of data augmentation to train copy-mechanism models, which augments the input dataset by copying user utterances and replacing the real slot values with randomly generated strings.
arXiv Detail & Related papers (2020-02-22T05:40:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.