Related papers: VSPrefill: Vertical-Slash Sparse Attention with Lightweight Indexing for Long-Context Prefilling

VSPrefill: Vertical-Slash Sparse Attention with Lightweight Indexing for Long-Context Prefilling

URL: http://arxiv.org/abs/2603.04460v1
Date: Tue, 03 Mar 2026 09:24:58 GMT
Title: VSPrefill: Vertical-Slash Sparse Attention with Lightweight Indexing for Long-Context Prefilling
Authors: Chen Guanzhong,
Abstract summary: Existing sparse attention methods face a trade-off among context adaptivity, sampling overhead, and fine-tuning costs.<n>We propose VSPrefill, a mechanism requiring lightweight training that uses the vertical-slash structural pattern in attention distributions.<n>VSPrefill preserves 98.35% of the full attention accuracy while delivering a 4.95x average speedup at a context length of 128k.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The quadratic complexity of self-attention during the prefill phase impedes long-context inference in large language models. Existing sparse attention methods face a trade-off among context adaptivity, sampling overhead, and fine-tuning costs. We propose VSPrefill, a mechanism requiring lightweight training that uses the vertical-slash structural pattern in attention distributions. Our compact VSIndexer module predicts context-aware importance scores for vertical columns and slash diagonals from key-value representations augmented with RoPE. This approach constructs sparse masks with linear complexity without modifying the backbone parameters. During inference, an adaptive cumulative-threshold strategy allocates sparsity budgets per layer, while a fused kernel executes attention with on-the-fly index merging. Evaluated on Qwen3-4B-Instruct and LLaMA-3.1-8B-Instruct across the LongBench and RULER benchmarks, VSPrefill preserves 98.35% of the full attention accuracy while delivering a 4.95x average speedup at a context length of 128k. These results establish a new Pareto frontier in the trade-off between accuracy and efficiency.

Related papers

Punctuation-aware Hybrid Trainable Sparse Attention for Large Language Models [44.28116882776357]
We present textbfPunctuation-aware textbfHybrid textbfSparse textbfAttention textbf(PHSA), a trainable sparse attention framework that leverages punctuation tokens as semantic boundary anchors.<n>Specifically, we design a dual-branch aggregation mechanism that fuses global semantic representations with punctuation-enhanced boundary features, preserving the core semantic structure while introducing almost no additional computational overhead.
arXiv Detail & Related papers (2026-01-06T08:47:16Z)
Training-free Context-adaptive Attention for Efficient Long Context Modeling [57.703159205740185]
Training-free Context-adaptive Attention (TCA-Attention) is a training-free sparse attention mechanism that selectively attends to only the informative tokens for efficient long-context inference.<n>TCA-Attention achieves a 2.8$times$ speedup and reduces KV cache by 61% at 128K context length while maintaining performance comparable to full attention.
arXiv Detail & Related papers (2025-12-10T01:54:57Z)
Long-Context Modeling with Dynamic Hierarchical Sparse Attention for On-Device LLMs [17.499497967319332]
We introduce Dynamic Hierarchical Sparse Attention (DHSA), a data-driven framework that dynamically predicts attention sparsity online without retraining.<n>DHSA matches dense attention in accuracy, while reducing prefill latency by 20-60% and peak memory usage by 35%.<n>Our experiments on Gemma2 with Needle-in-a-Haystack Test and LongBench show that DHSA matches dense attention in accuracy, while reducing prefill latency by 20-60% and peak memory usage by 35%.
arXiv Detail & Related papers (2025-10-28T16:34:18Z)
Beyond Homogeneous Attention: Memory-Efficient LLMs via Fourier-Approximated KV Cache [67.47789629197857]
We propose a training-free framework that exploits the heterogeneous roles of transformer head dimensions.<n>By projecting the long-context-insensitive dimensions onto Fourier bases, FourierAttention approximates their temporal evolution with fixed-length spectral coefficients.<n>We show that FourierAttention achieves the best long-context accuracy on LongBench and Needle-In-A-Haystack.
arXiv Detail & Related papers (2025-06-13T15:35:54Z)
SALE : Low-bit Estimation for Efficient Sparse Attention in Long-context LLM Prefilling [24.241825495462397]
Existing sparse attention methods accelerate attention computation by skipping less significant regions of the attention map.<n>We propose SALE, a fine-grained attention method that accelerates the long-context prefilling stage of LLM with negligible loss in model accuracy.<n> SALE achieves at least 3.36x speedups on Llama-3.1-8B for sequences longer than 64K while maintaining model quality.
arXiv Detail & Related papers (2025-05-30T03:40:24Z)
AnchorAttention: Difference-Aware Sparse Attention with Stripe Granularity [9.63873831179673]
Large Language Models (LLMs) with extended context lengths face significant computational challenges during the pre-filling phase.<n>We propose textbfAnchorAttention, a difference-aware, dynamic sparse attention mechanism that efficiently identifies critical attention regions.<n>With its finer-grained sparsity strategy, textbfAnchorAttention achieves higher sparsity rates at the same recall level, significantly reducing computation time.
arXiv Detail & Related papers (2025-05-29T14:59:06Z)
ParallelComp: Parallel Long-Context Compressor for Length Extrapolation [51.68913021512016]
Extrapolating ultra-long contexts (text length >128K) remains a major challenge for large language models (LLMs)<n>In this work, we propose ParallelComp, a parallel long-context compression method that effectively overcomes the memory bottleneck.<n>We achieve a 1.76x improvement in chunk throughput, thereby achieving a 23.50x acceleration in the prefill stage with negligible performance loss.
arXiv Detail & Related papers (2025-02-20T07:10:43Z)
Squeezed Attention: Accelerating Long Context Length LLM Inference [61.787865959140994]
We propose Squeezed Attention to accelerate applications where a large portion of the input context is fixed.<n>During inference, we compare query tokens from the user input with the centroids to predict which keys from the fixed context are semantically relevant.<n>We also present a hierarchical version of our algorithm which can reduce the complexity of attention from linear to logarithmic with respect to the fixed context length.
arXiv Detail & Related papers (2024-11-14T18:54:19Z)
SampleAttention: Near-Lossless Acceleration of Long Context LLM Inference with Adaptive Structured Sparse Attention [53.4441894198495]
Large language models (LLMs) now support extremely long context windows.<n>The quadratic complexity of vanilla attention results in significantly long Time-to-First-Token (TTFT) latency.<n>We propose SampleAttention, an adaptive structured and near-lossless sparse attention.
arXiv Detail & Related papers (2024-06-17T11:05:15Z)
HyperAttention: Long-context Attention in Near-Linear Time [78.33061530066185]
We present an approximate attention mechanism named HyperAttention to address the computational challenges posed by the growing complexity of long contexts. Empirically, employing Locality Sensitive Hashing (LSH) to identify large entries, HyperAttention outperforms existing methods. We validate the empirical performance of HyperAttention on a variety of different long-context length datasets.
arXiv Detail & Related papers (2023-10-09T17:05:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.