FlashBias: Fast Computation of Attention with Bias
- URL: http://arxiv.org/abs/2505.12044v3
- Date: Fri, 24 Oct 2025 02:49:22 GMT
- Title: FlashBias: Fast Computation of Attention with Bias
- Authors: Haixu Wu, Minghao Guo, Yuezhou Ma, Yuanxu Sun, Jianmin Wang, Wojciech Matusik, Mingsheng Long,
- Abstract summary: Attention with bias has been widely deployed in vision, language, protein-folding and other advanced scientific models.<n>It disrupts the tightly fused memory-compute pipeline that underlies the speed of accelerators like FlashAttention.<n>This paper presents FlashBias based on the low-rank compressed sensing theory, which can provide fast-exact computation for many widely used attention biases.
- Score: 70.44379606190569
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Attention with bias, which extends standard attention by introducing prior knowledge as an additive bias matrix to the query-key scores, has been widely deployed in vision, language, protein-folding and other advanced scientific models, underscoring its status as a key evolution of this foundational module. However, introducing bias terms creates a severe efficiency bottleneck in attention computation. It disrupts the tightly fused memory-compute pipeline that underlies the speed of accelerators like FlashAttention, thereby stripping away most of their performance gains and leaving biased attention computationally expensive. Surprisingly, despite its common usage, targeted efficiency optimization for attention with bias remains absent, which seriously hinders its application in complex tasks. Diving into the computation of FlashAttention, we prove that its optimal efficiency is determined by the rank of the attention weight matrix. Inspired by this theoretical result, this paper presents FlashBias based on the low-rank compressed sensing theory, which can provide fast-exact computation for many widely used attention biases and a fast-accurate approximation for biases in general formalizations. FlashBias can fully take advantage of the extremely optimized matrix multiplication operation in modern GPUs, achieving 1.5$\times$ speedup for Pairformer in AlphaFold 3, and over 2$\times$ speedup for attention with bias in vision and language models without loss of accuracy. Code is available at this repository: https://github.com/thuml/FlashBias.
Related papers
- Vectorized FlashAttention with Low-cost Exponential Computation in RISC-V Vector Processors [5.385189465543017]
This work focuses on the acceleration of attention kernel using FlashAttention algorithm, in vector processors.<n>By utilizing a low-cost approximation for exponentials in floating-point arithmetic, we reduce the cost of computing the exponential function.
arXiv Detail & Related papers (2025-10-08T09:55:32Z) - FuXi-β: Towards a Lightweight and Fast Large-Scale Generative Recommendation Model [87.38823851271758]
We propose a new framework for Transformer-like recommendation models.<n>FuXi-$beta$ outperforms previous state-of-the-art models and achieves significant acceleration.<n>Our code is available in a public repository: https://github.com/USTC-StarTeam/FuXi-beta$.
arXiv Detail & Related papers (2025-08-14T13:12:29Z) - Transformers Learn Faster with Semantic Focus [57.97235825738412]
We study sparse transformers in terms of learnability and generalization.<n>We find that input-dependent sparse attention models appear to converge faster and generalize better than standard attention models.
arXiv Detail & Related papers (2025-06-17T01:19:28Z) - Accelerating Prefilling for Long-Context LLMs via Sparse Pattern Sharing [4.7924863950812995]
Sparse attention methods exploit the inherent sparsity in attention to speed up the prefilling phase of long-context inference.<n>We propose a highly accurate sparse attention mechanism that shares similar yet precise attention patterns across heads.<n>Our method effectively captures actual patterns while requiring full attention for only a small subset of heads.
arXiv Detail & Related papers (2025-05-26T06:48:53Z) - Delta Attention: Fast and Accurate Sparse Attention Inference by Delta Correction [52.14200610448542]
A transformer has a quadratic complexity, leading to high inference costs and latency for long sequences.<n>We propose a simple, novel, and effective procedure for correcting this distributional shift.<n>Our method can maintain approximately 98.5% sparsity over full quadratic attention, making our model 32 times faster than Flash Attention 2 when processing 1M token prefills.
arXiv Detail & Related papers (2025-05-16T13:48:33Z) - XAttention: Block Sparse Attention with Antidiagonal Scoring [10.517760961650279]
Long-context Transformer Models (LCTMs) are vital for real-world applications but suffer high computational costs due to attention's quadratic complexity.<n>We introduce XAttention, a plug-and-play framework that dramatically accelerates long-context inference in Transformers models using sparse attention.
arXiv Detail & Related papers (2025-03-20T17:59:58Z) - Attention Condensation via Sparsity Induced Regularized Training [0.0]
Self-attention dominates the transformer's inference time as the context window expands.<n>We extend a theoretical framework of attention sparsity in Large Language Models.<n>A customized loss function is designed to enforce the sparsity by restricting the number of top elements in the attention matrix.
arXiv Detail & Related papers (2025-03-03T14:09:13Z) - Online Pseudo-average Shifting Attention(PASA) for Robust Low-precision LLM Inference: Algorithms and Numerical Analysis [15.71443217369106]
We develop a low-precision, mathematically-equivalent algorithm called PASA, based on Flash Attention.<n> PASA introduces two novel techniques: online pseudo-average shifting and global recovering.<n>We find that the large bias and amplitude of attention input data are critical factors contributing to numerical overflow.
arXiv Detail & Related papers (2025-02-26T01:00:46Z) - Unveiling and Harnessing Hidden Attention Sinks: Enhancing Large Language Models without Training through Attention Calibration [15.36841874118801]
We aim to provide a more profound understanding of the existence of attention sinks within large language models (LLMs)
We propose a training-free Attention Technique (ACT) that automatically optimize the attention distributions on the fly during inference in an input-adaptive manner.
ACT achieves an average improvement of up to 7.30% in accuracy across different datasets when applied to Llama-30B.
arXiv Detail & Related papers (2024-06-22T07:00:43Z) - How Sparse Attention Approximates Exact Attention? Your Attention is Naturally $n^C$-Sparse [9.552839922307587]
Sparse Attention is a technique that approximates standard attention computation with sub-quadratic complexity.<n> Variations of this technique, such as pruning KV cache, sparsity-based fast attention, and Sparse Transformer, have been extensively utilized for efficient Large Language Models (LLMs) deployment.
arXiv Detail & Related papers (2024-04-03T12:37:34Z) - Faster Causal Attention Over Large Sequences Through Sparse Flash
Attention [45.18552512844457]
We extend FlashAttention to accommodate a large class of attention sparsity patterns.
We increase the training speed of a transformer language model by $2.0times$ and $3.3times$ for sequences of respectively $8k$ and $16k$ tokens.
arXiv Detail & Related papers (2023-06-01T21:33:59Z) - Kernel-Whitening: Overcome Dataset Bias with Isotropic Sentence
Embedding [51.48582649050054]
We propose a representation normalization method which aims at disentangling the correlations between features of encoded sentences.
We also propose Kernel-Whitening, a Nystrom kernel approximation method to achieve more thorough debiasing on nonlinear spurious correlations.
Experiments show that Kernel-Whitening significantly improves the performance of BERT on out-of-distribution datasets while maintaining in-distribution accuracy.
arXiv Detail & Related papers (2022-10-14T05:56:38Z) - SparseBERT: Rethinking the Importance Analysis in Self-attention [107.68072039537311]
Transformer-based models are popular for natural language processing (NLP) tasks due to its powerful capacity.
Attention map visualization of a pre-trained model is one direct method for understanding self-attention mechanism.
We propose a Differentiable Attention Mask (DAM) algorithm, which can be also applied in guidance of SparseBERT design.
arXiv Detail & Related papers (2021-02-25T14:13:44Z) - Bayesian Attention Modules [65.52970388117923]
We propose a scalable version of attention that is easy to implement and optimize.
Our experiments show the proposed method brings consistent improvements over the corresponding baselines.
arXiv Detail & Related papers (2020-10-20T20:30:55Z) - Wave Propagation of Visual Stimuli in Focus of Attention [77.4747032928547]
Fast reactions to changes in the surrounding visual environment require efficient attention mechanisms to reallocate computational resources to most relevant locations in the visual field.
We present a biologically-plausible model of focus of attention that exhibits effectiveness and efficiency exhibited by foveated animals.
arXiv Detail & Related papers (2020-06-19T09:33:21Z) - Focus of Attention Improves Information Transfer in Visual Features [80.22965663534556]
This paper focuses on unsupervised learning for transferring visual information in a truly online setting.
The computation of the entropy terms is carried out by a temporal process which yields online estimation of the entropy terms.
In order to better structure the input probability distribution, we use a human-like focus of attention model.
arXiv Detail & Related papers (2020-06-16T15:07:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.