Faster Neighborhood Attention: Reducing the O(n^2) Cost of Self Attention at the Threadblock Level
- URL: http://arxiv.org/abs/2403.04690v3
- Date: Thu, 31 Oct 2024 17:32:26 GMT
- Title: Faster Neighborhood Attention: Reducing the O(n^2) Cost of Self Attention at the Threadblock Level
- Authors: Ali Hassani, Wen-Mei Hwu, Humphrey Shi,
- Abstract summary: Neighborhood attention reduces the cost of self attention by restricting each token's attention span to its nearest neighbors.
We show that neighborhood attention can be represented as a batched GEMM problem, similar to standard attention.
We develop fused neighborhood attention; an adaptation of fused dot-product attention kernels that allow fine-grained control over attention across different spatial axes.
- Score: 30.681204292813998
- License:
- Abstract: Neighborhood attention reduces the cost of self attention by restricting each token's attention span to its nearest neighbors. This restriction, parameterized by a window size and dilation factor, draws a spectrum of possible attention patterns between linear projection and self attention. Neighborhood attention, and more generally sliding window attention patterns, have long been bounded by infrastructure, particularly in higher-rank spaces (2-D and 3-D), calling for the development of custom kernels, which have been limited in either functionality, or performance, if not both. In this work, we aim to massively improve upon existing infrastructure by providing two new methods for implementing neighborhood attention. We first show that neighborhood attention can be represented as a batched GEMM problem, similar to standard attention, and implement it for 1-D and 2-D neighborhood attention. These kernels on average provide 895% and 272% improvement in full precision runtime compared to existing naive CUDA kernels for 1-D and 2-D neighborhood attention respectively. We find that aside from being heavily bound by memory bandwidth, certain inherent inefficiencies exist in all unfused implementations of neighborhood attention, which in most cases undo their theoretical efficiency gain. Motivated by the progress made into fused dot-product attention kernels, we developed fused neighborhood attention; an adaptation of fused dot-product attention kernels that allow fine-grained control over attention across different spatial axes. Known for reducing the quadratic time complexity of self attention to a linear complexity, neighborhood attention can now enjoy a reduced and constant memory footprint, and record-breaking half precision runtime. We observe that our fused implementation successfully circumvents some of the unavoidable inefficiencies in unfused implementations...
Related papers
- S2-Attention: Hardware-Aware Context Sharding Among Attention Heads [49.1454481007861]
Sparse attention selectively attends to a subset of tokens in the context.
It remains unclear whether sparse attention can maintain the model's quality at a scale of today's large language models.
This paper presents Sparsely-Sharded(S2) Attention, a Triton library that provides kernel optimization for sparse attention customizable at both per-head and per-context-range levels.
arXiv Detail & Related papers (2024-07-25T00:27:07Z) - Short-Long Convolutions Help Hardware-Efficient Linear Attention to Focus on Long Sequences [60.489682735061415]
We propose CHELA, which replaces state space models with short-long convolutions and implements linear attention in a divide-and-conquer manner.
Our experiments on the Long Range Arena benchmark and language modeling tasks demonstrate the effectiveness of the proposed method.
arXiv Detail & Related papers (2024-06-12T12:12:38Z) - Lightning Attention-2: A Free Lunch for Handling Unlimited Sequence
Lengths in Large Language Models [20.78813311569383]
We present Lightning Attention, the first linear attention implementation that enables linear attention to realize its theoretical computational benefits.
Specifically, we utilize the conventional attention mechanism for the intra-blocks and apply linear attention kernel tricks for the inter-blocks.
Various experiments are conducted on different model sizes and sequence lengths.
arXiv Detail & Related papers (2024-01-09T16:27:28Z) - RFAConv: Innovating Spatial Attention and Standard Convolutional Operation [7.2646541547165056]
We propose a novel attention mechanism called Receptive-Field Attention (RFA)
RFA focuses on the receptive-field spatial feature but also provides effective attention weights for large-size convolutional kernels.
It offers nearly negligible increment of computational cost and parameters, while significantly improving network performance.
arXiv Detail & Related papers (2023-04-06T16:21:56Z) - UNETR++: Delving into Efficient and Accurate 3D Medical Image Segmentation [93.88170217725805]
We propose a 3D medical image segmentation approach, named UNETR++, that offers both high-quality segmentation masks as well as efficiency in terms of parameters, compute cost, and inference speed.
The core of our design is the introduction of a novel efficient paired attention (EPA) block that efficiently learns spatial and channel-wise discriminative features.
Our evaluations on five benchmarks, Synapse, BTCV, ACDC, BRaTs, and Decathlon-Lung, reveal the effectiveness of our contributions in terms of both efficiency and accuracy.
arXiv Detail & Related papers (2022-12-08T18:59:57Z) - Rethinking Query-Key Pairwise Interactions in Vision Transformers [5.141895475956681]
We propose key-only attention, which excludes query-key pairwise interactions and uses a compute-efficient saliency-gate to obtain attention weights.
We develop a new self-attention model family, LinGlos, which reach state-of-the-art accuracies on the parameter-limited setting of ImageNet classification benchmark.
arXiv Detail & Related papers (2022-07-01T03:36:49Z) - Towards Joint Intent Detection and Slot Filling via Higher-order
Attention [47.78365472691051]
Intent detection (ID) and Slot filling (SF) are two major tasks in spoken language understanding (SLU)
We propose a Bilinear attention block, which exploits both the contextual and channel-wise bilinear attention distributions.
We show that our approach yields improvements compared with the state-of-the-art approach.
arXiv Detail & Related papers (2021-09-18T09:50:23Z) - Unlocking Pixels for Reinforcement Learning via Implicit Attention [61.666538764049854]
We make use of new efficient attention algorithms, recently shown to be highly effective for Transformers.
This allows our attention-based controllers to scale to larger visual inputs, and facilitate the use of smaller patches.
In addition, we propose a new efficient algorithm approximating softmax attention with what we call hybrid random features.
arXiv Detail & Related papers (2021-02-08T17:00:26Z) - AttentionNAS: Spatiotemporal Attention Cell Search for Video
Classification [86.64702967379709]
We propose a novel search space fortemporal attention cells, which allows the search algorithm to flexibly explore various design choices in the cell.
The discovered attention cells can be seamlessly inserted into existing backbone networks, e.g., I3D or S3D, and improve video accuracy by more than 2% on both Kinetics-600 and MiT datasets.
arXiv Detail & Related papers (2020-07-23T14:30:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.