FlashBias: Fast Computation of Attention with Bias
- URL: http://arxiv.org/abs/2505.12044v1
- Date: Sat, 17 May 2025 15:12:50 GMT
- Title: FlashBias: Fast Computation of Attention with Bias
- Authors: Haixu Wu, Minghao Guo, Yuezhou Ma, Yuanxu Sun, Jianmin Wang, Wojciech Matusik, Mingsheng Long,
- Abstract summary: This paper presents FlashBias based on the low-rank compressed sensing theory, which can provide fast-exact computation for many widely used attention biases.<n>FlashBias can fully take advantage of the extremely optimized matrix multiplication operation in modern GPUs, achieving 1.5$times$ speedup for AlphaFold, and over 2$times$ speedup for attention with bias in vision and language models without loss of accuracy.
- Score: 77.39043478894504
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Attention mechanism has emerged as a foundation module of modern deep learning models and has also empowered many milestones in various domains. Moreover, FlashAttention with IO-aware speedup resolves the efficiency issue of standard attention, further promoting its practicality. Beyond canonical attention, attention with bias also widely exists, such as relative position bias in vision and language models and pair representation bias in AlphaFold. In these works, prior knowledge is introduced as an additive bias term of attention weights to guide the learning process, which has been proven essential for model performance. Surprisingly, despite the common usage of attention with bias, its targeted efficiency optimization is still absent, which seriously hinders its wide applications in complex tasks. Diving into the computation of FlashAttention, we prove that its optimal efficiency is determined by the rank of the attention weight matrix. Inspired by this theoretical result, this paper presents FlashBias based on the low-rank compressed sensing theory, which can provide fast-exact computation for many widely used attention biases and a fast-accurate approximation for biases in general formalization. FlashBias can fully take advantage of the extremely optimized matrix multiplication operation in modern GPUs, achieving 1.5$\times$ speedup for AlphaFold, and over 2$\times$ speedup for attention with bias in vision and language models without loss of accuracy.
Related papers
- Vectorized FlashAttention with Low-cost Exponential Computation in RISC-V Vector Processors [5.385189465543017]
This work focuses on the acceleration of attention kernel using FlashAttention algorithm, in vector processors.<n>By utilizing a low-cost approximation for exponentials in floating-point arithmetic, we reduce the cost of computing the exponential function.
arXiv Detail & Related papers (2025-10-08T09:55:32Z) - FuXi-β: Towards a Lightweight and Fast Large-Scale Generative Recommendation Model [87.38823851271758]
We propose a new framework for Transformer-like recommendation models.<n>FuXi-$beta$ outperforms previous state-of-the-art models and achieves significant acceleration.<n>Our code is available in a public repository: https://github.com/USTC-StarTeam/FuXi-beta$.
arXiv Detail & Related papers (2025-08-14T13:12:29Z) - Transformers Learn Faster with Semantic Focus [57.97235825738412]
We study sparse transformers in terms of learnability and generalization.<n>We find that input-dependent sparse attention models appear to converge faster and generalize better than standard attention models.
arXiv Detail & Related papers (2025-06-17T01:19:28Z) - Accelerating Prefilling for Long-Context LLMs via Sparse Pattern Sharing [4.7924863950812995]
Sparse attention methods exploit the inherent sparsity in attention to speed up the prefilling phase of long-context inference.<n>We propose a highly accurate sparse attention mechanism that shares similar yet precise attention patterns across heads.<n>Our method effectively captures actual patterns while requiring full attention for only a small subset of heads.
arXiv Detail & Related papers (2025-05-26T06:48:53Z) - Delta Attention: Fast and Accurate Sparse Attention Inference by Delta Correction [52.14200610448542]
A transformer has a quadratic complexity, leading to high inference costs and latency for long sequences.<n>We propose a simple, novel, and effective procedure for correcting this distributional shift.<n>Our method can maintain approximately 98.5% sparsity over full quadratic attention, making our model 32 times faster than Flash Attention 2 when processing 1M token prefills.
arXiv Detail & Related papers (2025-05-16T13:48:33Z) - XAttention: Block Sparse Attention with Antidiagonal Scoring [10.517760961650279]
Long-context Transformer Models (LCTMs) are vital for real-world applications but suffer high computational costs due to attention's quadratic complexity.<n>We introduce XAttention, a plug-and-play framework that dramatically accelerates long-context inference in Transformers models using sparse attention.
arXiv Detail & Related papers (2025-03-20T17:59:58Z) - Attention Condensation via Sparsity Induced Regularized Training [0.0]
Self-attention dominates the transformer's inference time as the context window expands.<n>We extend a theoretical framework of attention sparsity in Large Language Models.<n>A customized loss function is designed to enforce the sparsity by restricting the number of top elements in the attention matrix.
arXiv Detail & Related papers (2025-03-03T14:09:13Z) - Online Pseudo-average Shifting Attention(PASA) for Robust Low-precision LLM Inference: Algorithms and Numerical Analysis [15.71443217369106]
We develop a low-precision, mathematically-equivalent algorithm called PASA, based on Flash Attention.<n> PASA introduces two novel techniques: online pseudo-average shifting and global recovering.<n>We find that the large bias and amplitude of attention input data are critical factors contributing to numerical overflow.
arXiv Detail & Related papers (2025-02-26T01:00:46Z) - Unveiling and Harnessing Hidden Attention Sinks: Enhancing Large Language Models without Training through Attention Calibration [15.36841874118801]
We aim to provide a more profound understanding of the existence of attention sinks within large language models (LLMs)
We propose a training-free Attention Technique (ACT) that automatically optimize the attention distributions on the fly during inference in an input-adaptive manner.
ACT achieves an average improvement of up to 7.30% in accuracy across different datasets when applied to Llama-30B.
arXiv Detail & Related papers (2024-06-22T07:00:43Z) - How Sparse Attention Approximates Exact Attention? Your Attention is Naturally $n^C$-Sparse [9.552839922307587]
Sparse Attention is a technique that approximates standard attention computation with sub-quadratic complexity.<n> Variations of this technique, such as pruning KV cache, sparsity-based fast attention, and Sparse Transformer, have been extensively utilized for efficient Large Language Models (LLMs) deployment.
arXiv Detail & Related papers (2024-04-03T12:37:34Z) - Faster Causal Attention Over Large Sequences Through Sparse Flash
Attention [45.18552512844457]
We extend FlashAttention to accommodate a large class of attention sparsity patterns.
We increase the training speed of a transformer language model by $2.0times$ and $3.3times$ for sequences of respectively $8k$ and $16k$ tokens.
arXiv Detail & Related papers (2023-06-01T21:33:59Z) - Kernel-Whitening: Overcome Dataset Bias with Isotropic Sentence
Embedding [51.48582649050054]
We propose a representation normalization method which aims at disentangling the correlations between features of encoded sentences.
We also propose Kernel-Whitening, a Nystrom kernel approximation method to achieve more thorough debiasing on nonlinear spurious correlations.
Experiments show that Kernel-Whitening significantly improves the performance of BERT on out-of-distribution datasets while maintaining in-distribution accuracy.
arXiv Detail & Related papers (2022-10-14T05:56:38Z) - SparseBERT: Rethinking the Importance Analysis in Self-attention [107.68072039537311]
Transformer-based models are popular for natural language processing (NLP) tasks due to its powerful capacity.
Attention map visualization of a pre-trained model is one direct method for understanding self-attention mechanism.
We propose a Differentiable Attention Mask (DAM) algorithm, which can be also applied in guidance of SparseBERT design.
arXiv Detail & Related papers (2021-02-25T14:13:44Z) - Bayesian Attention Modules [65.52970388117923]
We propose a scalable version of attention that is easy to implement and optimize.
Our experiments show the proposed method brings consistent improvements over the corresponding baselines.
arXiv Detail & Related papers (2020-10-20T20:30:55Z) - Wave Propagation of Visual Stimuli in Focus of Attention [77.4747032928547]
Fast reactions to changes in the surrounding visual environment require efficient attention mechanisms to reallocate computational resources to most relevant locations in the visual field.
We present a biologically-plausible model of focus of attention that exhibits effectiveness and efficiency exhibited by foveated animals.
arXiv Detail & Related papers (2020-06-19T09:33:21Z) - Focus of Attention Improves Information Transfer in Visual Features [80.22965663534556]
This paper focuses on unsupervised learning for transferring visual information in a truly online setting.
The computation of the entropy terms is carried out by a temporal process which yields online estimation of the entropy terms.
In order to better structure the input probability distribution, we use a human-like focus of attention model.
arXiv Detail & Related papers (2020-06-16T15:07:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.