Related papers: FSA: An Alternative Efficient Implementation of Native Sparse Attention Kernel

FSA: An Alternative Efficient Implementation of Native Sparse Attention Kernel

URL: http://arxiv.org/abs/2508.18224v2
Date: Mon, 13 Oct 2025 08:46:58 GMT
Title: FSA: An Alternative Efficient Implementation of Native Sparse Attention Kernel
Authors: Ran Yan, Youhe Jiang, Zhuoming Chen, Haohui Mai, Beidi Chen, Binhang Yuan,
Abstract summary: Flash Sparse Attention (FSA) is an alternative kernel implementation that enables efficient NSA across a wide range of popular language models.<n>Compared to vanilla NSA kernel implementation, FSA achieves (i) up to 3.5x and on average 1.6x kernel-level latency reduction.
Score: 38.72781754531673
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent advance in sparse attention mechanisms has demonstrated strong potential for reducing the computational cost of long-context training and inference in large language models (LLMs). Native Sparse Attention (NSA), one state-of-the-art approach, introduces natively trainable, hardware-aligned sparse attention that delivers substantial system-level performance boost while maintaining accuracy comparable to full attention. However, the kernel implementation of NSA forces a loop order that is only efficient with a relatively large number of query heads in each Grouped Query Attention (GQA) group, whereas existing LLMs widely adopt much smaller number of query heads in each GQA group -- such an inconsistency significantly limits the applicability of this sparse algorithmic advance. In this work, we propose Flash Sparse Attention (FSA), an alternative kernel implementation that enables efficient NSA computation across a wide range of popular LLMs with varied smaller number of heads in each GQA group on modern GPUs. Compared to vanilla NSA kernel implementation, our empirical evaluation demonstrates that FSA achieves (i) up to 3.5x and on average 1.6x kernel-level latency reduction, (ii) up to 1.25x and 1.09x on average end-to-end training speedup on state-of-the-art LLMs, and (iii) up to 1.36x and 1.11x on average for prefill-phase speedup in LLM generative inference. Github Repo at https://github.com/Relaxed-System-Lab/Flash-Sparse-Attention.

Related papers

Prism: Efficient Test-Time Scaling via Hierarchical Search and Self-Verification for Discrete Diffusion Language Models [96.0074341403456]
Inference-time compute has re-emerged as a practical way to improve LLM reasoning.<n>Most test-time scaling (TTS) algorithms rely on autoregressive decoding.<n>We propose Prism, an efficient TTS framework for dLLMs.
arXiv Detail & Related papers (2026-02-02T09:14:51Z)
HALO: Semantic-Aware Distributed LLM Inference in Lossy Edge Network [50.33808558714122]
Large language models' (LLMs) inference at the edge can facilitate prompt service responsiveness while protecting user privacy.<n>We propose HALO, a novel framework that can boost the distributed LLM inference in lossy edge network.<n> Experimental results from a Raspberry Pi cluster demonstrate that HALO achieves a 3.41x end-to-end speedup for LLaMA-series LLMs under unreliable network conditions.
arXiv Detail & Related papers (2026-01-16T07:37:23Z)
RADSeg: Unleashing Parameter and Compute Efficient Zero-Shot Open-Vocabulary Segmentation Using Agglomerative Models [6.977949425464]
We leverage an overlooked agglomerative vision foundation model, RADIO, to improve zero-shot OVSS along three key axes simultaneously: mIoU, latency, and parameter efficiency.<n>Our approach, RADSeg, achieves 6-30% mIoU improvement in the base ViT class while being 3.95x faster and using 2.5x fewer parameters.
arXiv Detail & Related papers (2025-11-24T21:15:01Z)
LLM Inference Beyond a Single Node: From Bottlenecks to Mitigations with Fast All-Reduce Communication [5.468224958799568]
We present a detailed performance study of multi-node distributed inference using large language models (LLMs) on GPU-based supercomputers.<n>We conduct experiments with several state-of-the-art inference engines alongside YALIS, a research-oriented prototype engine designed for controlled experimentation.
arXiv Detail & Related papers (2025-11-12T18:59:26Z)
dInfer: An Efficient Inference Framework for Diffusion Language Models [54.80918957287927]
Diffusion-based large language models (dLLMs) have emerged as a promising alternative to autoregressive (AR) LLMs.<n>We present dInfer, an efficient and framework for dLLM inference.
arXiv Detail & Related papers (2025-10-09T16:19:42Z)
ProxyAttn: Guided Sparse Attention via Representative Heads [59.03412871683236]
We propose ProxyAttn, a training-free sparse attention algorithm that achieves more precise block estimation.<n>We show that ProxyAttn can achieve up to 10.3x attention acceleration and 2.4x prefilling acceleration without significant performance loss.
arXiv Detail & Related papers (2025-09-29T13:10:39Z)
AccelGen: Heterogeneous SLO-Guaranteed High-Throughput LLM Inference Serving for Diverse Applications [8.964981700274059]
We propose AccelGen, a high- throughput inference serving system with heterogeneous SLO guarantees for diverse applications.<n>Trace real experiments demonstrate that AccelGen achieves 1.42-11.21X higher throughput, 1.43-13.71X higher goodput, 37-90% higher SLO attainment, and 1.61-12.22X lower response latency compared to the state-of-the-art approaches.
arXiv Detail & Related papers (2025-03-17T21:47:43Z)
Dynamic Low-Rank Sparse Adaptation for Large Language Models [54.1231638555233]
Low-rank Sparse Adaptation (LoSA) is a novel method that seamlessly integrates low-rank adaptation into sparse LLM sparsity.<n>LoSA dynamically sparsifies the LoRA outcomes based on the corresponding sparse weights during fine-tuning.<n>LoSA can efficiently boost the efficacy of sparse LLMs within a few hours, without introducing any additional inferential burden.
arXiv Detail & Related papers (2025-02-20T18:37:32Z)
SparAMX: Accelerating Compressed LLMs Token Generation on AMX-powered CPUs [5.760049762453579]
Accelerating large language models with CPUs enables broader AI access at a lower cost and power consumption.<n>We provide a set of open-source customized sparse kernels that can speed up any PyTorch model.<n>We demonstrate for the first time the use of unstructured sparsity in the attention achieving a $1.14 times$ speedup over the current systems.
arXiv Detail & Related papers (2025-02-18T02:26:34Z)
Highly Optimized Kernels and Fine-Grained Codebooks for LLM Inference on Arm CPUs [0.8217552831952]
Large language models (LLMs) have transformed the way we think about language understanding and generation.<n>Group quantization formats commonly used for LLM quantization have significant compute overheads and a resource-intensive dequantization process.<n>We present a groupwise non-uniform codebook-based quantization method for ultra-low-precision quantization of LLMs to better match non-uniform patterns in their weight distributions.
arXiv Detail & Related papers (2024-12-23T03:44:29Z)
Squeezed Attention: Accelerating Long Context Length LLM Inference [61.787865959140994]
We propose Squeezed Attention to accelerate applications where a large portion of the input context is fixed.<n>During inference, we compare query tokens from the user input with the centroids to predict which keys from the fixed context are semantically relevant.<n>We also present a hierarchical version of our algorithm which can reduce the complexity of attention from linear to logarithmic with respect to the fixed context length.
arXiv Detail & Related papers (2024-11-14T18:54:19Z)
HSR-Enhanced Sparse Attention Acceleration [19.776342074253435]
We introduce a novel approach to accelerate attention computation in Large Language Models (LLMs)<n>We leverage the inherent sparsity within attention mechanisms, both in conventional Softmax attention and ReLU attention.<n>Our method only introduces provably negligible error for Softmax attention.
arXiv Detail & Related papers (2024-10-14T05:18:02Z)
Search for Efficient Large Language Models [52.98684997131108]
Large Language Models (LLMs) have long held sway in the realms of artificial intelligence research. Weight pruning, quantization, and distillation have been embraced to compress LLMs, targeting memory reduction and inference acceleration. Most model compression techniques concentrate on weight optimization, overlooking the exploration of optimal architectures.
arXiv Detail & Related papers (2024-09-25T21:32:12Z)
MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention [36.49445805074941]
MInference (Milliontokens Inference) is a sparse calculation method designed to accelerate pre-filling of long-sequence processing. We demonstrate that MInference effectively reduces inference latency by up to 10x for pre-filling on an A100, while maintaining accuracy.
arXiv Detail & Related papers (2024-07-02T17:59:56Z)
BiLLM: Pushing the Limit of Post-Training Quantization for LLMs [53.31402059062365]
BiLLM is a groundbreaking 1-bit post-training quantization scheme tailored for pretrained large language models. It achieves for the first time high-accuracy inference (e.g. 8.41 perplexity on LLaMA2-70B) with only 1.08-bit weights across various LLMs families.
arXiv Detail & Related papers (2024-02-06T09:26:34Z)
Distributed Inference and Fine-tuning of Large Language Models Over The Internet [91.00270820533272]
Large language models (LLMs) are useful in many NLP tasks and become more capable with size. These models require high-end hardware, making them inaccessible to most researchers. We develop fault-tolerant inference algorithms and load-balancing protocols that automatically assign devices to maximize the total system throughput.
arXiv Detail & Related papers (2023-12-13T18:52:49Z)
FusionAI: Decentralized Training and Deploying LLMs with Massive Consumer-Level GPUs [57.12856172329322]
We envision a decentralized system unlocking the potential vast untapped consumer-level GPU. This system faces critical challenges, including limited CPU and GPU memory, low network bandwidth, the variability of peer and device heterogeneity.
arXiv Detail & Related papers (2023-09-03T13:27:56Z)
A New Algorithm for Tessellated Kernel Learning [4.264192013842097]
An ideal set of kernels should: admit a linear parameterization (for tractability); be dense in the set of all kernels (for robustness); be universal (for accuracy) The recently proposed Tesselated Kernels (TKs) is currently the only known class which meets all three criteria. By contrast, the 2-step algorithm proposed here scales to 10,000 data points and extends to the regression problem.
arXiv Detail & Related papers (2020-06-13T18:33:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.