Related papers: Block Sparse Flash Attention

Block Sparse Flash Attention

URL: http://arxiv.org/abs/2512.07011v1
Date: Sun, 07 Dec 2025 21:20:12 GMT
Title: Block Sparse Flash Attention
Authors: Daniel Ohayon, Itay Lamprecht, Itay Hubara, Israel Cohen, Daniel Soudry, Noam Elata,
Abstract summary: Block-Sparse FlashAttention is a drop-in replacement for FlashAttention.<n>It computes exact query-key similarities to select the top-k most important value blocks for each query.<n>It achieves up to 1.10x speedup on real-world reasoning benchmarks and up to 1.24x needle-in-a-haystack retrieval tasks.
Score: 29.499030734003952
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Modern large language models increasingly require long contexts for reasoning and multi-document tasks, but attention's quadratic complexity creates a severe computational bottleneck. We present Block-Sparse FlashAttention (BSFA), a drop-in replacement that accelerates long-context inference while preserving model quality. Unlike methods that predict importance before computing scores, BSFA computes exact query-key similarities to select the top-k most important value blocks for each query. By comparing per-block maximum scores against calibrated thresholds, we skip approximately 50% of the computation and memory transfers for pruned blocks. Our training-free approach requires only a one-time threshold calibration on a small dataset to learn the per-layer and per-head attention score distributions. We provide a CUDA kernel implementation that can be used as a drop-in replacement for FlashAttention. On Llama-3.1-8B, BSFA achieves up to 1.10x speedup on real-world reasoning benchmarks and up to 1.24x for needle-in-a-haystack retrieval tasks while maintaining above 99% baseline accuracy, with certain configurations even improving accuracy by focusing on the most relevant content, substantially outperforming existing sparse attention methods. The implementation is available at https://github.com/Danielohayon/Block-Sparse-Flash-Attention

Related papers

AdaSpot: Spend Resolution Where It Matters for Precise Event Spotting [59.31340724915079]
Event Spotting is a key task for applications in sports analytics, robotics, and autonomous systems.<n>bfAdaSpot achieves state-of-the-art performance under strict evaluation metrics.
arXiv Detail & Related papers (2026-02-25T16:24:48Z)
FASA: Frequency-aware Sparse Attention [56.26881872333624]
We propose FASA, a novel framework that achieves query-aware token eviction by dynamically predicting token importance.<n>Our key finding is that a small, identifiable subset of "dominant" FCs consistently exhibits high contextual agreement with the full attention head.<n>Across a spectrum of long-context tasks, FASA consistently outperforms all token-eviction baselines and achieves near-oracle accuracy.
arXiv Detail & Related papers (2026-02-03T06:09:06Z)
Optimizing Mixture of Block Attention [12.276306440688137]
We develop a statistical model to analyze MoBA's underlying mechanics.<n>We identify two key pathways for improvement: using smaller block sizes and applying a short convolution on keys to cluster relevant signals.<n>We introduce FlashMoBA, a hardware-aware kernel that enables efficient MoBA execution even with the small block sizes our theory recommends.
arXiv Detail & Related papers (2025-11-14T18:59:59Z)
Efficient Low Rank Attention for Long-Context Inference in Large Language Models [41.24530756499533]
Low Rank Query and Key attention (LRQK) is a framework that decomposes the full-precision query and key matrices into compact rank-(r) factors during the prefill stage.<n>By selecting only the top-(k) tokens and a small fixed set of recent tokens, LRQK employs a mixed GPU- CPU cache with a hit-and-miss mechanism that transfers only missing full-precision KV pairs.
arXiv Detail & Related papers (2025-10-25T11:43:27Z)
Sparser Block-Sparse Attention via Token Permutation [46.22204775916057]
We propose Permuted Block-Sparse Attention (textbfPBS-Attn), a plug-and-play method that leverages the permutation properties of attention to increase block-level sparsity.<n>Powered by our custom permuted-FlashAttention kernels, PBS-Attn achieves an end-to-end speedup of up to $2.75times$ in long-context prefilling.
arXiv Detail & Related papers (2025-10-24T09:11:50Z)
ProxyAttn: Guided Sparse Attention via Representative Heads [59.03412871683236]
We propose ProxyAttn, a training-free sparse attention algorithm that achieves more precise block estimation.<n>We show that ProxyAttn can achieve up to 10.3x attention acceleration and 2.4x prefilling acceleration without significant performance loss.
arXiv Detail & Related papers (2025-09-29T13:10:39Z)
Squeezed Attention: Accelerating Long Context Length LLM Inference [61.787865959140994]
We propose Squeezed Attention to accelerate applications where a large portion of the input context is fixed.<n>During inference, we compare query tokens from the user input with the centroids to predict which keys from the fixed context are semantically relevant.<n>We also present a hierarchical version of our algorithm which can reduce the complexity of attention from linear to logarithmic with respect to the fixed context length.
arXiv Detail & Related papers (2024-11-14T18:54:19Z)
CritiPrefill: A Segment-wise Criticality-based Approach for Prefilling Acceleration in LLMs [8.649971923487835]
We propose CritiPrefill, a criticality-based segment-wise prefilling method for long-context processing. CritiPrefill partitions the input sequence's queries and KV cache into segments and blocks, utilizing a segment-wise algorithm to estimate the query criticality. Extensive evaluations on multiple long-context datasets show up to 2.7x speedup on Llama3-8B and 3.0x speedup on Yi-9B for 128K context length on a single A100 GPU.
arXiv Detail & Related papers (2024-09-19T06:09:56Z)
RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval [24.472784635757016]
RetrievalAttention is a training-free approach to both accelerate attention computation and reduce GPU memory consumption.<n>We show that RetrievalAttention achieves near full attention accuracy while only requiring access to 1--3% of the data.
arXiv Detail & Related papers (2024-09-16T17:59:52Z)
Training-Free Exponential Context Extension via Cascading KV Cache [49.608367376911694]
We introduce a novel mechanism that leverages cascading sub-cache buffers to selectively retain the most relevant tokens.<n>Our method reduces prefill stage latency by a factor of 6.8 when compared to flash attention on 1M tokens.
arXiv Detail & Related papers (2024-06-24T03:59:17Z)
IRLI: Iterative Re-partitioning for Learning to Index [104.72641345738425]
Methods have to trade between obtaining high accuracy while maintaining load balance and scalability in distributed settings. We propose a novel approach called IRLI, which iteratively partitions the items by learning the relevant buckets directly from the query-item relevance data. We mathematically show that IRLI retrieves the correct item with high probability under very natural assumptions and provides superior load balancing.
arXiv Detail & Related papers (2021-03-17T23:13:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.