Related papers: Flashlight: PyTorch Compiler Extensions to Accelerate Attention Variants

Flashlight: PyTorch Compiler Extensions to Accelerate Attention Variants

URL: http://arxiv.org/abs/2511.02043v3
Date: Fri, 07 Nov 2025 17:26:02 GMT
Title: Flashlight: PyTorch Compiler Extensions to Accelerate Attention Variants
Authors: Bozhi You, Irene Wang, Zelal Su Mustafaoglu, Abhinav Jangda, Angélica Moreira, Roshan Dathathri, Divya Mahajan, Keshav Pingali,
Abstract summary: We introduce Flashlight, a compiler-native framework within the PyTorch ecosystem that automatically generates fused, FlashAttention-style kernels for arbitrary attention-based programs.<n>We show that Flashlight produces kernels with competitive or superior performance to FlexAttention.
Score: 2.9955129797385482
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Attention is a fundamental building block of large language models (LLMs), so there have been many efforts to implement it efficiently. For example, FlashAttention leverages tiling and kernel fusion to optimize attention. Recently, a number of variants of attention have been introduced to enhance model quality or efficiency. Supporting them efficiently remains difficult since they usually require specialized kernels or hand-tuned implementations. FlexAttention recently addressed part of this gap by using static programming templates to support FlashAttention-like kernels for a subset of attention variants. In this paper, we introduce Flashlight, a compiler-native framework within the PyTorch ecosystem that automatically generates fused, FlashAttention-style kernels for arbitrary attention-based programs, without relying on static templates or predefined kernel specializations. Flashlight leverages PyTorch's compilation workflow to fuse and tile attention computations transparently, enabling efficient execution for diverse attention patterns. Not only does it support all variants expressible in the FlexAttention model but it also handles more general, data-dependent attention formulations that are beyond the capabilities of FlexAttention. Our results show that Flashlight produces kernels with competitive or superior performance to FlexAttention, while offering the flexibility of native PyTorch code, enabling developers to rapidly explore new attention models without sacrificing performance.

Related papers

FlashFormer: Whole-Model Kernels for Efficient Low-Batch Inference [42.19497037894398]
FlashFormer is a proof-of-concept kernel for accelerating single-batch inference for transformer-based large language models.<n>We observe nontrivial speedups compared to existing state-of-the-art inference kernels.
arXiv Detail & Related papers (2025-05-28T18:19:30Z)
FlashBias: Fast Computation of Attention with Bias [70.44379606190569]
Attention with bias has been widely deployed in vision, language, protein-folding and other advanced scientific models.<n>It disrupts the tightly fused memory-compute pipeline that underlies the speed of accelerators like FlashAttention.<n>This paper presents FlashBias based on the low-rank compressed sensing theory, which can provide fast-exact computation for many widely used attention biases.
arXiv Detail & Related papers (2025-05-17T15:12:50Z)
TileLang: A Composable Tiled Programming Model for AI Systems [17.240134151647187]
We present TileLang, a generalized tiled programming model for more efficient AI programming.<n> TileLang decouples scheduling space (thread binding, layout, tensorize and pipeline) from dataflow, and encapsulated them as a set of customization annotations and primitives.<n>We conduct comprehensive experiments on commonly-used devices, across numerous experiments, our evaluation shows that TileLang can achieve state-of-the-art performance in key kernels.
arXiv Detail & Related papers (2025-04-24T14:08:49Z)
FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving [9.386969461835433]
FlashInfer is a customizable and efficient attention engine for large language models (LLMs)<n>It tackles KV-cache storage heterogeneity using block-sparse format and composable formats to optimize memory access and reduce redundancy.<n>It also offers a customizable attention template, enabling adaptation to various settings through Just-In-TimeJIT compilation.
arXiv Detail & Related papers (2025-01-02T02:02:20Z)
Flex Attention: A Programming Model for Generating Optimized Attention Kernels [5.489362130813523]
We introduce FlexAttention, a novel compiler-driven programming model that allows implementing the majority of attention variants in a few lines of PyTorch code.<n>We demonstrate how FlexAttention allows for easy composition of attention variants, solving the explosion of attention variants.
arXiv Detail & Related papers (2024-12-07T01:46:38Z)
S2-Attention: Hardware-Aware Context Sharding Among Attention Heads [49.1454481007861]
Sparse attention selectively attends to a subset of tokens in the context.<n>It remains unclear whether sparse attention can maintain the model's quality at a scale of today's large language models.<n>This paper presents Sparsely-Sharded(S2) Attention, a Triton library that provides kernel optimization for sparse attention customizable at both per-head and per-context-range levels.
arXiv Detail & Related papers (2024-07-25T00:27:07Z)
Sparser is Faster and Less is More: Efficient Sparse Attention for Long-Range Transformers [58.5711048151424]
We introduce SPARSEK Attention, a novel sparse attention mechanism designed to overcome computational and memory obstacles. Our approach integrates a scoring network and a differentiable top-k mask operator, SPARSEK, to select a constant number of KV pairs for each query. Experimental results reveal that SPARSEK Attention outperforms previous sparse attention methods.
arXiv Detail & Related papers (2024-06-24T15:55:59Z)
Faster Causal Attention Over Large Sequences Through Sparse Flash Attention [45.18552512844457]
We extend FlashAttention to accommodate a large class of attention sparsity patterns. We increase the training speed of a transformer language model by $2.0times$ and $3.3times$ for sequences of respectively $8k$ and $16k$ tokens.
arXiv Detail & Related papers (2023-06-01T21:33:59Z)
EfficientViT: Multi-Scale Linear Attention for High-Resolution Dense Prediction [67.11722682878722]
This work presents EfficientViT, a new family of high-resolution vision models with novel multi-scale linear attention. Our multi-scale linear attention achieves the global receptive field and multi-scale learning. EfficientViT delivers remarkable performance gains over previous state-of-the-art models.
arXiv Detail & Related papers (2022-05-29T20:07:23Z)
Toward Fast, Flexible, and Robust Low-Light Image Enhancement [87.27326390675155]
We develop a new Self-Calibrated Illumination (SCI) learning framework for fast, flexible, and robust brightening images in real-world low-light scenarios. Considering the computational burden of the cascaded pattern, we construct the self-calibrated module which realizes the convergence between results of each stage. We make comprehensive explorations to SCI's inherent properties including operation-insensitive adaptability and model-irrelevant generality.
arXiv Detail & Related papers (2022-04-21T14:40:32Z)
Bayesian Attention Modules [65.52970388117923]
We propose a scalable version of attention that is easy to implement and optimize. Our experiments show the proposed method brings consistent improvements over the corresponding baselines.
arXiv Detail & Related papers (2020-10-20T20:30:55Z)

This list is automatically generated from the titles and abstracts of the papers in this site.