Flex Attention: A Programming Model for Generating Optimized Attention Kernels
- URL: http://arxiv.org/abs/2412.05496v1
- Date: Sat, 07 Dec 2024 01:46:38 GMT
- Title: Flex Attention: A Programming Model for Generating Optimized Attention Kernels
- Authors: Juechu Dong, Boyuan Feng, Driss Guessous, Yanbo Liang, Horace He,
- Abstract summary: We introduce FlexAttention, a novel compiler-driven programming model that allows implementing the majority of attention variants in a few lines of PyTorch code.<n>We demonstrate how FlexAttention allows for easy composition of attention variants, solving the explosion of attention variants.
- Score: 5.489362130813523
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Over the past 7 years, attention has become one of the most important primitives in deep learning. The primary approach to optimize attention is FlashAttention, which fuses the operation together, drastically improving both the runtime and the memory consumption. However, the importance of FlashAttention combined with its monolithic nature poses a problem for researchers aiming to try new attention variants -- a "software lottery". This problem is exacerbated by the difficulty of writing efficient fused attention kernels, resisting traditional compiler-based approaches. We introduce FlexAttention, a novel compiler-driven programming model that allows implementing the majority of attention variants in a few lines of idiomatic PyTorch code. We demonstrate that many existing attention variants (e.g. Alibi, Document Masking, PagedAttention, etc.) can be implemented via FlexAttention, and that we achieve competitive performance compared to these handwritten kernels. Finally, we demonstrate how FlexAttention allows for easy composition of attention variants, solving the combinatorial explosion of attention variants.
Related papers
- TopV: Compatible Token Pruning with Inference Time Optimization for Fast and Low-Memory Multimodal Vision Language Model [56.43860351559185]
We introduce textbfTopV, a compatible textbfTOken textbfPruning with inference Time Optimization for fast and low-memory textbfVLM.
Our framework incorporates a visual-aware cost function to measure the importance of each source visual token, enabling effective pruning of low-importance tokens.
arXiv Detail & Related papers (2025-03-24T01:47:26Z) - FlexPrefill: A Context-Aware Sparse Attention Mechanism for Efficient Long-Sequence Inference [9.878232553789433]
Large language models (LLMs) encounter computational challenges during long-sequence inference.
Previous efforts have relied on fixed sparse attention patterns or identifying sparse attention patterns based on limited cases.
We introduce FlexPrefill, a Flexible sparse Pre-filling mechanism that dynamically adjusts sparse attention patterns and computational budget in real-time.
arXiv Detail & Related papers (2025-02-28T06:34:53Z) - Attention Entropy is a Key Factor: An Analysis of Parallel Context Encoding with Full-attention-based Pre-trained Language Models [49.84163262868945]
Large language models have shown remarkable performance across a wide range of language tasks, owing to their exceptional capabilities in context modeling.
The most commonly used method of context modeling is full self-attention, as seen in standard decoder-only Transformers.
We propose parallel context encoding, which splits the context into sub-pieces and encodes them parallelly.
arXiv Detail & Related papers (2024-12-21T09:04:51Z) - Sparser is Faster and Less is More: Efficient Sparse Attention for Long-Range Transformers [58.5711048151424]
We introduce SPARSEK Attention, a novel sparse attention mechanism designed to overcome computational and memory obstacles.
Our approach integrates a scoring network and a differentiable top-k mask operator, SPARSEK, to select a constant number of KV pairs for each query.
Experimental results reveal that SPARSEK Attention outperforms previous sparse attention methods.
arXiv Detail & Related papers (2024-06-24T15:55:59Z) - An end-to-end attention-based approach for learning on graphs [8.552020965470113]
transformer-based architectures for learning on graphs are motivated by attention as an effective learning mechanism.<n>We propose a purely attention-based approach consisting of an encoder and an attention pooling mechanism.<n>Despite its simplicity, the approach outperforms fine-tuned message passing baselines and recently proposed transformer-based methods on more than 70 node and graph-level tasks.
arXiv Detail & Related papers (2024-02-16T16:20:11Z) - FLatten Transformer: Vision Transformer using Focused Linear Attention [80.61335173752146]
Linear attention offers a much more efficient alternative with its linear complexity.
Current linear attention approaches either suffer from significant performance degradation or introduce additional computation overhead.
We propose a novel Focused Linear Attention module to achieve both high efficiency and expressiveness.
arXiv Detail & Related papers (2023-08-01T10:37:12Z) - Faster Causal Attention Over Large Sequences Through Sparse Flash
Attention [45.18552512844457]
We extend FlashAttention to accommodate a large class of attention sparsity patterns.
We increase the training speed of a transformer language model by $2.0times$ and $3.3times$ for sequences of respectively $8k$ and $16k$ tokens.
arXiv Detail & Related papers (2023-06-01T21:33:59Z) - Bayesian Attention Belief Networks [59.183311769616466]
Attention-based neural networks have achieved state-of-the-art results on a wide range of tasks.
This paper introduces Bayesian attention belief networks, which construct a decoder network by modeling unnormalized attention weights.
We show that our method outperforms deterministic attention and state-of-the-art attention in accuracy, uncertainty estimation, generalization across domains, and adversarial attacks.
arXiv Detail & Related papers (2021-06-09T17:46:22Z) - Unlocking Pixels for Reinforcement Learning via Implicit Attention [61.666538764049854]
We make use of new efficient attention algorithms, recently shown to be highly effective for Transformers.
This allows our attention-based controllers to scale to larger visual inputs, and facilitate the use of smaller patches.
In addition, we propose a new efficient algorithm approximating softmax attention with what we call hybrid random features.
arXiv Detail & Related papers (2021-02-08T17:00:26Z) - Bayesian Attention Modules [65.52970388117923]
We propose a scalable version of attention that is easy to implement and optimize.
Our experiments show the proposed method brings consistent improvements over the corresponding baselines.
arXiv Detail & Related papers (2020-10-20T20:30:55Z) - Gaussian Constrained Attention Network for Scene Text Recognition [16.485898019983797]
We argue that the existing attention mechanism faces the problem of attention diffusion, in which the model may not focus on a certain character area.
We propose a 2D attention-based method integrated with a novel Gaussian Constrained Refinement Module.
In this way, the attention weights will be more concentrated and the attention-based recognition network achieves better performance.
arXiv Detail & Related papers (2020-10-19T01:55:30Z) - Learning fine-grained search space pruning and heuristics for
combinatorial optimization [5.72274610208488]
We propose a framework for leveraging machine learning techniques to scale-up exact optimization algorithms.
Our framework learns the relatively simpler task of pruning the elements in order to reduce the size of the problem instances.
We show that our framework can prune a large fraction of the input graph and still detect almost all of the maximum cliques.
arXiv Detail & Related papers (2020-01-05T13:10:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.