Related papers: FlashAttention-4: Algorithm and Kernel Pipelining Co-Design for Asymmetric Hardware Scaling

FlashAttention-4: Algorithm and Kernel Pipelining Co-Design for Asymmetric Hardware Scaling

URL: http://arxiv.org/abs/2603.05451v1
Date: Thu, 05 Mar 2026 18:24:49 GMT
Title: FlashAttention-4: Algorithm and Kernel Pipelining Co-Design for Asymmetric Hardware Scaling
Authors: Ted Zadouri, Markus Hoehnerbach, Jay Shah, Timmy Liu, Vijay Thakkar, Tri Dao,
Abstract summary: FlashAttention-4 achieves up to 1.3$times$ speedup over cuDNN 9.13 and 2.7$times$ over Triton on B200 GPUs with BF16.<n>We implement FlashAttention-4 entirely in CuTe- embedded in Python, achieving 20-30$times$ faster compile times compared to traditional C++ template-based approaches.
Score: 20.849774181776414
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Attention, as a core layer of the ubiquitous Transformer architecture, is the bottleneck for large language models and long-context applications. While FlashAttention-3 optimized attention for Hopper GPUs through asynchronous execution and warp specialization, it primarily targets the H100 architecture. The AI industry has rapidly transitioned to deploying Blackwell-based systems such as the B200 and GB200, which exhibit fundamentally different performance characteristics due to asymmetric hardware scaling: tensor core throughput doubles while other functional units (shared memory bandwidth, exponential units) scale more slowly or remain unchanged. We develop several techniques to address these shifting bottlenecks on Blackwell GPUs: (1) redesigned pipelines that exploit fully asynchronous MMA operations and larger tile sizes, (2) software-emulated exponential and conditional softmax rescaling that reduces non-matmul operations, and (3) leveraging tensor memory and the 2-CTA MMA mode to reduce shared memory traffic and atomic adds in the backward pass. We demonstrate that our method, FlashAttention-4, achieves up to 1.3$\times$ speedup over cuDNN 9.13 and 2.7$\times$ over Triton on B200 GPUs with BF16, reaching up to 1613 TFLOPs/s (71% utilization). Beyond algorithmic innovations, we implement FlashAttention-4 entirely in CuTe-DSL embedded in Python, achieving 20-30$\times$ faster compile times compared to traditional C++ template-based approaches while maintaining full expressivity.

Related papers

Memory-Efficient Acceleration of Block Low-Rank Foundation Models on Resource Constrained GPUs [11.45717904490388]
Recent advances in transformer-based foundation models have made them the default choice for many tasks.<n>Their rapidly growing size makes fitting a full model on a single GPU increasingly difficult and their computational cost prohibitive.<n>Block low-rank (BLR) compression techniques address this challenge by learning compact representations of weight matrices.
arXiv Detail & Related papers (2025-12-24T00:41:13Z)
Eliminating Multi-GPU Performance Taxes: A Systems Approach to Efficient Distributed LLMs [61.953548065938385]
We introduce the ''Three Taxes'' (Bulk Synchronous, Inter- Kernel Data Locality, and Kernel Launch Overhead) as an analytical framework.<n>We propose moving beyond the rigid BSP model to address key inefficiencies in distributed GPU execution.<n>We observe a 10-20% speedup in end-to-end latency over BSP-based approaches.
arXiv Detail & Related papers (2025-11-04T01:15:44Z)
Minute-Long Videos with Dual Parallelisms [57.22737565366549]
Diffusion Transformer (DiT)-based video diffusion models generate high-quality videos at scale but incur prohibitive processing latency and memory costs for long videos.<n>We propose a novel distributed inference strategy, termed DualParal.<n>Instead of generating an entire video on a single GPU, we parallelize both temporal frames and model layers across GPUs.
arXiv Detail & Related papers (2025-05-27T11:55:22Z)
FlashAttention on a Napkin: A Diagrammatic Approach to Deep Learning IO-Awareness [0.0]
Methods like FlashAttention have achieved a x6 performance improvement over native PyTorch by avoiding unnecessary data transfers.<n>This paper extends Neural Circuit Diagrams for deep learning models to consider resource usage and the distribution of tasks across a GPU hierarchy.<n>We develop a methodology for representing intermediate-level pseudocode with diagrams, allowing hardware-aware algorithms to be derived step-by-step.
arXiv Detail & Related papers (2024-12-04T13:52:04Z)
ThunderKittens: Simple, Fast, and Adorable AI Kernels [43.32681787348603]
We present ThunderKittens (TK), a framework for writing performant AI kernels while remaining easy to use and maintain. We show the value of TK by providing kernels that match or outperform prior kernels for a range of AI operations.
arXiv Detail & Related papers (2024-10-27T10:07:16Z)
Breaking the Memory Barrier: Near Infinite Batch Size Scaling for Contrastive Loss [59.835032408496545]
We propose a tile-based strategy that partitions the contrastive loss calculation into arbitrary small blocks. We also introduce a multi-level tiling strategy to leverage the hierarchical structure of distributed systems. Compared to SOTA memory-efficient solutions, it achieves a two-order-of-magnitude reduction in memory while maintaining comparable speed.
arXiv Detail & Related papers (2024-10-22T17:59:30Z)
FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision [14.426543629408984]
Attention is the bottleneck for large language models and long-context applications. We develop three main techniques to speed up attention on GPUs. We demonstrate that our method, FlashAttention-3, achieves speedup on H100 GPU by 1.5-2.0$times$ with FP16 reaching up to 740 TFLOPs/s (75% utilization) and with FP8 reaching close to 1.2 PFLOPs/s.
arXiv Detail & Related papers (2024-07-11T15:44:48Z)
AI and Memory Wall [81.06494558184049]
We show how memory bandwidth can become the dominant bottleneck for decoder models. We argue for a redesign in model architecture, training, and deployment strategies to overcome this memory limitation.
arXiv Detail & Related papers (2024-03-21T04:31:59Z)
DISTFLASHATTN: Distributed Memory-efficient Attention for Long-context LLMs Training [82.06732962485754]
FlashAttention effectively reduces the quadratic peak memory usage to linear in training transformer-based large language models (LLMs) on a single GPU. We introduce DISTFLASHATTN, a memory-efficient attention mechanism optimized for long-context LLMs training. It achieves 1.67x and 1.26 - 1.88x speedup compared to recent Ring Attention and DeepSpeed-Ulysses.
arXiv Detail & Related papers (2023-10-05T03:47:57Z)
Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity [12.663030430488922]
We propose Flash-LLM for enabling low-cost and highly-efficient large generative model inference on high-performance Cores. At SpMM kernel level, Flash-LLM significantly outperforms the state-of-the-art library, i.e., Sputnik and SparTA by an average of 2.9x and 1.5x, respectively.
arXiv Detail & Related papers (2023-09-19T03:20:02Z)
FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning [11.508362885430133]
We exploit the asymmetric GPU memory hierarchy to bring significant memory saving and runtime speedup. FlashAttention is still not nearly as fast as optimized matrix-multiply (GEMM) operations, reaching only 25-40% of the theoretical maximum FLOPs/s. We propose FlashAttention-2, with better work partitioning to address these issues.
arXiv Detail & Related papers (2023-07-17T17:50:36Z)
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness [80.3586155104237]
FlashAttention is an IO-aware exact attention algorithm for Transformers. It reduces the number of memory reads/writes between GPU high bandwidth memory (HBM) and GPU on-chip. FlashAttention and block-sparse FlashAttention enable longer context in Transformers.
arXiv Detail & Related papers (2022-05-27T17:53:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.