You Need to Pay Better Attention: Rethinking the Mathematics of Attention Mechanism
- URL: http://arxiv.org/abs/2403.01643v2
- Date: Thu, 30 May 2024 17:46:22 GMT
- Title: You Need to Pay Better Attention: Rethinking the Mathematics of Attention Mechanism
- Authors: Mehran Hosseini, Peyman Hosseini,
- Abstract summary: Scaled Dot Product Attention (SDPA) is the backbone of many modern deep-learning models.
This paper introduces three enhanced attention mechanisms: Optimised, Efficient, and Super Attention.
Super Attention introduces a new linear transformation on the values, transforming them from the left.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Scaled Dot Product Attention (SDPA) is the backbone of many modern deep-learning models. It is so versatile that it has been used in natural language, vision, and multi-modal domains with very little change compared to its original formulation. This paper discusses why the current formulation is inefficient by delving into the mathematical details of the attention mechanism. We propose three improvements to mitigate these inefficiencies, thereby, introducing three enhanced attention mechanisms: Optimised, Efficient, and Super Attention. Optimised and Efficient Attention have one and two matrix multiplications fewer per head, respectively, and 25% and 50% fewer parameters, respectively, than standard SDPA, but perform similarly to standard SDPA in both vision and natural language tasks. They can be used in all applications where SDPA is used while offering smaller model sizes and faster training and inference without noticeable loss in performance. Super Attention introduces a new linear transformation on the values, transforming them from the left. It outperforms standard SPDA on vision and natural language tasks by up to 17% while having one fewer matrix multiplication per head and 25% fewer parameters than standard SDPA. Consequently, it is also faster than standard SDPA. Super Attention is ideal in applications where the attention layer's context length is fixed, such as Vision Transformers. In addition to providing mathematical reasoning, we evaluate the presented attention mechanisms on several datasets including MNIST, CIFAR100, ImageNet, IMDB Movie Reviews, and Amazon Reviews datasets, as well as combined Europarl and Anki English-Spanish datasets for neural machine translation.
Related papers
- Perception-Aware Policy Optimization for Multimodal Reasoning [79.56070395437898]
A major source of error in current multimodal reasoning lies in the perception of visual inputs.<n>We propose PAPO, a novel policy gradient algorithm that encourages the model to learn to perceive while learning to reason.<n>We observe a substantial reduction of 30.5% in perception errors, indicating improved perceptual capabilities with PAPO.
arXiv Detail & Related papers (2025-07-08T23:22:34Z) - Harnessing On-Device Large Language Model: Empirical Results and Implications for AI PC [8.837470787975308]
Large Language Models (LLMs) on edge devices offer significant privacy benefits.<n>These on-device LLMs inherently face performance limitations due to reduced model capacity and necessary compression techniques.<n>We introduce a systematic methodology -- encompassing model capability, development efficiency, and system resources -- for evaluating on-device LLMs.
arXiv Detail & Related papers (2025-05-21T02:23:01Z) - Structure-Activation Synergy: A Dual Efficiency Framework for Parameter-Memory Optimized Transfer Learning [8.602744958104969]
We present Structure-Activation Synergy (S2A), an innovative framework achieving dual optimization of parameters and memory.
We show S2A's superior efficiency, reducing GPU memory consumption by 75% (4.2 average reduction) while maintaining 98.7% of full fine-tuning accuracy with only 0.9% tunable parameters.
arXiv Detail & Related papers (2025-03-11T08:10:03Z) - Sparse Gradient Compression for Fine-Tuning Large Language Models [58.44973963468691]
Fine-tuning large language models (LLMs) for downstream tasks has become increasingly crucial due to their widespread use and the growing availability of open-source models.
High memory costs associated with fine-tuning remain a significant challenge, especially as models increase in size.
We propose sparse compression gradient (SGC) to address these limitations.
arXiv Detail & Related papers (2025-02-01T04:18:28Z) - Adaptive Pruning for Large Language Models with Structural Importance Awareness [66.2690963378878]
Large language models (LLMs) have significantly improved language understanding and generation capabilities.
LLMs are difficult to deploy on resource-constrained edge devices due to their high computational and storage resource demands.
We propose structurally-aware adaptive pruning (SAAP) to significantly reduce the computational and memory costs while maintaining model performance.
arXiv Detail & Related papers (2024-12-19T18:08:04Z) - OP-LoRA: The Blessing of Dimensionality [93.08208871549557]
Low-rank adapters enable fine-tuning of large models with only a small number of parameters.
They often pose optimization challenges, with poor convergence.
We introduce an over- parameterized approach that accelerates training without increasing inference costs.
We achieve improvements in vision-language tasks and especially notable increases in image generation.
arXiv Detail & Related papers (2024-12-13T18:55:19Z) - EchoAtt: Attend, Copy, then Adjust for More Efficient Large Language Models [29.57891007810509]
Large Language Models (LLMs) have demonstrated outstanding performance across a variety of natural language processing tasks.
We introduce EchoAtt, a novel framework aimed at optimizing transformer-based models by analyzing and leveraging the similarity of attention patterns across layers.
Our best results with TinyLLaMA-1.1B demonstrate that EchoAtt improves inference speed by 15%, training speed by 25%, and reduces the number of parameters by approximately 4%, all while improving zero-shot performance.
arXiv Detail & Related papers (2024-09-22T21:08:37Z) - Propulsion: Steering LLM with Tiny Fine-Tuning [0.0]
We propose Propulsion, a novel parameter efficient fine-tuning (PEFT) method to optimize task-specific performance.
Inspired by the concept of controlled adjustments in physical motion, Propulsion selectively re-scales specific dimensions of a pre-trained model.
Our theoretical analysis, supported by Neural Tangent Kernel (NTK) theory, shows that Propulsion approximates the performance of full fine-tuning with far fewer trainable parameters.
arXiv Detail & Related papers (2024-09-17T06:51:59Z) - PADRe: A Unifying Polynomial Attention Drop-in Replacement for Efficient Vision Transformer [33.71410239689095]
PADRe is a framework designed to replace the conventional self-attention mechanism in transformer models.
PADRe's key components include multiplicative nonlinearities, which we implement using straightforward, hardware-friendly operations.
We assess the effectiveness of PADRe as a drop-in replacement for self-attention across diverse computer vision tasks.
arXiv Detail & Related papers (2024-07-16T01:45:44Z) - Not All Attention is Needed: Parameter and Computation Efficient Transfer Learning for Multi-modal Large Language Models [73.48675708831328]
We propose a novel parameter and computation efficient tuning method for Multi-modal Large Language Models (MLLMs)
The Efficient Attention Skipping (EAS) method evaluates the attention redundancy and skips the less important MHAs to speed up inference.
The experiments show that EAS not only retains high performance and parameter efficiency, but also greatly speeds up inference speed.
arXiv Detail & Related papers (2024-03-22T14:20:34Z) - Simple linear attention language models balance the recall-throughput
tradeoff [40.08746299497935]
We propose BASED, a simple architecture combining linear and sliding window attention.
We train language models up to 1.3b parameters and show that BASED matches the strongest sub-quadratic models in perplexity and outperforms them on real-world recall-intensive tasks by 6.22 accuracy points.
arXiv Detail & Related papers (2024-02-28T19:28:27Z) - Pit One Against Many: Leveraging Attention-head Embeddings for
Parameter-efficient Multi-head Attention [42.92397219764559]
We propose an alternative module that uses only a single shared projection matrix and multiple head embeddings (MHE)
We empirically demonstrate that our MHE attention is substantially more memory efficient compared to alternative attention mechanisms.
arXiv Detail & Related papers (2023-10-11T21:38:40Z) - Democratizing LLMs: An Exploration of Cost-Performance Trade-offs in
Self-Refined Open-Source Models [53.859446823312126]
SoTA open source models of varying sizes from 7B - 65B, on average, improve 8.2% from their baseline performance.
Strikingly, even models with extremely small memory footprints, such as Vicuna-7B, show a 11.74% improvement overall and up to a 25.39% improvement in high-creativity, open ended tasks.
arXiv Detail & Related papers (2023-10-11T15:56:00Z) - SEA: Sparse Linear Attention with Estimated Attention Mask [51.22399593954608]
Long seqeuences pose a problem due to the quadratic complexity of the attention operation.
Previous research has aimed to lower the complexity by sparsifying or linearly approximating the attention matrix.
We propose SEA: Sparse linear attention with an Estimated Attention mask.
arXiv Detail & Related papers (2023-10-03T03:56:26Z) - E^2VPT: An Effective and Efficient Approach for Visual Prompt Tuning [55.50908600818483]
Fine-tuning large-scale pretrained vision models for new tasks has become increasingly parameter-intensive.
We propose an Effective and Efficient Visual Prompt Tuning (E2VPT) approach for large-scale transformer-based model adaptation.
Our approach outperforms several state-of-the-art baselines on two benchmarks.
arXiv Detail & Related papers (2023-07-25T19:03:21Z) - Quantizable Transformers: Removing Outliers by Helping Attention Heads
Do Nothing [18.673619610942197]
Modern transformer models tend to learn strong outliers in their activations, making them difficult to quantize.
We show that strong outliers are related to very specific behavior of attention heads that try to learn a "no-op" or just a partial update of the residual.
We propose two simple (independent) modifications to the attention mechanism - clipped softmax and gated attention.
arXiv Detail & Related papers (2023-06-22T14:39:04Z) - Accurate and Structured Pruning for Efficient Automatic Speech
Recognition [23.897482741744117]
We propose a novel compression strategy to reduce the model size and inference cost of the Conformer model.
Our method achieves a 50% reduction in model size and a 28% reduction in inference cost with minimal performance loss.
arXiv Detail & Related papers (2023-05-31T04:31:16Z) - SwiftFormer: Efficient Additive Attention for Transformer-based
Real-time Mobile Vision Applications [98.90623605283564]
We introduce a novel efficient additive attention mechanism that effectively replaces the quadratic matrix multiplication operations with linear element-wise multiplications.
We build a series of models called "SwiftFormer" which achieves state-of-the-art performance in terms of both accuracy and mobile inference speed.
Our small variant achieves 78.5% top-1 ImageNet-1K accuracy with only 0.8 ms latency on iPhone 14, which is more accurate and 2x faster compared to MobileViT-v2.
arXiv Detail & Related papers (2023-03-27T17:59:58Z) - UNETR++: Delving into Efficient and Accurate 3D Medical Image Segmentation [93.88170217725805]
We propose a 3D medical image segmentation approach, named UNETR++, that offers both high-quality segmentation masks as well as efficiency in terms of parameters, compute cost, and inference speed.
The core of our design is the introduction of a novel efficient paired attention (EPA) block that efficiently learns spatial and channel-wise discriminative features.
Our evaluations on five benchmarks, Synapse, BTCV, ACDC, BRaTs, and Decathlon-Lung, reveal the effectiveness of our contributions in terms of both efficiency and accuracy.
arXiv Detail & Related papers (2022-12-08T18:59:57Z) - How Much Does Attention Actually Attend? Questioning the Importance of
Attention in Pretrained Transformers [59.57128476584361]
We introduce PAPA, a new probing method that replaces the input-dependent attention matrices with constant ones.
We find that without any input-dependent attention, all models achieve competitive performance.
We show that better-performing models lose more from applying our method than weaker models, suggesting that the utilization of the input-dependent attention mechanism might be a factor in their success.
arXiv Detail & Related papers (2022-11-07T12:37:54Z) - Monarch: Expressive Structured Matrices for Efficient and Accurate
Training [64.6871423399431]
Large neural networks excel in many domains, but they are expensive to train and fine-tune.
A popular approach to reduce their compute or memory requirements is to replace dense weight matrices with structured ones.
We propose a class of matrices (Monarch) that is hardware-efficient.
arXiv Detail & Related papers (2022-04-01T17:37:29Z) - AdaViT: Adaptive Tokens for Efficient Vision Transformer [91.88404546243113]
We introduce AdaViT, a method that adaptively adjusts the inference cost of vision transformer (ViT) for images of different complexity.
AdaViT achieves this by automatically reducing the number of tokens in vision transformers that are processed in the network as inference proceeds.
arXiv Detail & Related papers (2021-12-14T18:56:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.