Related papers: SampleAttention: Near-Lossless Acceleration of Long Context LLM Inference with Adaptive Structured Sparse Attention

SampleAttention: Near-Lossless Acceleration of Long Context LLM Inference with Adaptive Structured Sparse Attention

URL: http://arxiv.org/abs/2406.15486v2
Date: Fri, 28 Jun 2024 08:55:17 GMT
Title: SampleAttention: Near-Lossless Acceleration of Long Context LLM Inference with Adaptive Structured Sparse Attention
Authors: Qianchao Zhu, Jiangfei Duan, Chang Chen, Siran Liu, Xiuhong Li, Guanyu Feng, Xin Lv, Huanqi Cao, Xiao Chuanfu, Xingcheng Zhang, Dahua Lin, Chao Yang,
Abstract summary: Large language models (LLMs) now support extremely long context windows. The quadratic complexity of vanilla attention results in significantly long Time-to-First-Token (TTFT) latency. We propose SampleAttention, an adaptive structured and near-lossless sparse attention.
Score: 47.5772915135952
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) now support extremely long context windows, but the quadratic complexity of vanilla attention results in significantly long Time-to-First-Token (TTFT) latency. Existing approaches to address this complexity require additional pretraining or finetuning, and often sacrifice model accuracy. In this paper, we first provide both theoretical and empirical foundations for near-lossless sparse attention. We find dynamically capturing head-specific sparse patterns at runtime with low overhead is crucial. To address this, we propose SampleAttention, an adaptive structured and near-lossless sparse attention. Leveraging observed significant sparse patterns, SampleAttention attends to a fixed percentage of adjacent tokens to capture local window patterns, and employs a two-stage query-guided key-value filtering approach, which adaptively select a minimum set of key-values with low overhead, to capture column stripe patterns. Comprehensive evaluations show that SampleAttention can seamlessly replace vanilla attention in off-the-shelf LLMs with nearly no accuracy loss, and reduces TTFT by up to $2.42\times$ compared with FlashAttention.

Related papers

Forecasting Time Series with LLMs via Patch-Based Prompting and Decomposition [48.50019311384125]
We explore simple and flexible prompt-based strategies that enable LLMs to perform time series forecasting without extensive retraining.<n>We propose our own method, PatchInstruct, which enables LLMs to make precise and effective predictions.
arXiv Detail & Related papers (2025-06-15T19:42:58Z)
A theoretical framework for self-supervised contrastive learning for continuous dependent data [86.50780641055258]
Self-supervised learning (SSL) has emerged as a powerful approach to learning representations, particularly in the field of computer vision.<n>We propose a novel theoretical framework for contrastive SSL tailored to emphsemantic independence between samples.<n>Specifically, we outperform TS2Vec on the standard UEA and UCR benchmarks, with accuracy improvements of $4.17$% and $2.08$%, respectively.
arXiv Detail & Related papers (2025-06-11T14:23:47Z)
Saliency-driven Dynamic Token Pruning for Large Language Models [32.903622070917194]
Saliency-driven Dynamic Token Pruning (SDTP) A lightweight saliency-driven prediction module is designed to estimate the importance score of each token with its hidden state. A ranking-based optimization strategy is proposed to minimize the ranking divergence of the saliency score and the predicted importance score.
arXiv Detail & Related papers (2025-04-06T15:15:07Z)
Debiased Prompt Tuning in Vision-Language Model without Annotations [14.811475313694041]
Vision-Language Models (VLMs) may suffer from the problem of spurious correlations. By leveraging pseudo-spurious attribute annotations, we propose a method to automatically adjust the training weights of different groups. Our approach efficiently improves the worst-group accuracy on CelebA, Waterbirds, and MetaShift datasets.
arXiv Detail & Related papers (2025-03-11T12:24:54Z)
PowerAttention: Exponentially Scaling of Receptive Fields for Effective Sparse Attention [73.26995918610669]
Large Language Models (LLMs) face efficiency bottlenecks due to the quadratic complexity of the attention mechanism when processing long contexts. We introduce PowerAttention, a novel sparse attention design that facilitates effective and complete context extension. Experiments demonstrate that PowerAttention outperforms existing static sparse attention methods by $5sim 40%$.
arXiv Detail & Related papers (2025-03-05T15:24:11Z)
Squeezed Attention: Accelerating Long Context Length LLM Inference [64.11145320159126]
We propose Squeezed Attention as a mechanism to accelerate LLM applications where a large portion of the input prompt is fixed. We use K-means clustering offline to group the keys for the fixed context based on semantic similarity and represent each cluster with a single centroid value. We then compute exact attention using only these important keys from the fixed context, thereby reducing bandwidth and computational costs.
arXiv Detail & Related papers (2024-11-14T18:54:19Z)
Inference Optimal VLMs Need Only One Visual Token but Larger Models [54.01228554126122]
Vision Language Models (VLMs) have demonstrated strong capabilities across various visual understanding and reasoning tasks. VLMs are often constrained by high latency during inference due to substantial compute required to process the large number of input tokens. We take some initial steps towards building approaches tailored for high token compression settings.
arXiv Detail & Related papers (2024-11-05T18:54:21Z)
Efficient Inference for Large Language Model-based Generative Recommendation [78.38878421030522]
Large Language Model (LLM)-based generative recommendation has achieved notable success, yet its practical deployment is costly. Applying Speculative Decoding (SD) to generative recommendation presents unique challenges due to the requirement of generating top-K items. We propose an alignment framework named AtSpeed, which presents the AtSpeed-S optimization objective for top-K alignment under the strict top-K verification.
arXiv Detail & Related papers (2024-10-07T16:23:36Z)
Boosting MLPs with a Coarsening Strategy for Long-Term Time Series Forecasting [6.481470306093991]
Deep learning methods have been exerting their strengths in long-term time series forecasting. They often struggle to strike a balance between expressive power and computational efficiency. Here, we propose a coarsening strategy that alleviates the problems associated with the prototypes by forming information granules in place of solitary temporal points. Based purely on convolutions of structural simplicity, CP-Net is able to maintain a linear computational complexity and low runtime, while demonstrating an improvement of 4.1% compared with the SOTA method on seven forecasting benchmarks.
arXiv Detail & Related papers (2024-05-06T06:47:44Z)
SparseTSF: Modeling Long-term Time Series Forecasting with 1k Parameters [16.966008476215258]
This paper introduces SparseTSF, a novel, extremely lightweight model for Long-term Time Series Forecasting (LTSF) At the heart of SparseTSF lies the Cross-Period Sparse Forecasting technique, which simplifies the forecasting task by decoupling the periodicity and trend in time series data. SparseTSF showcases remarkable generalization capabilities, making it well-suited for scenarios with limited computational resources, small samples, or low-quality data.
arXiv Detail & Related papers (2024-05-02T02:15:23Z)
CALF: Aligning LLMs for Time Series Forecasting via Cross-modal Fine-Tuning [59.88924847995279]
We propose a novel Cross-Modal LLM Fine-Tuning (CALF) framework for MTSF. To reduce the distribution discrepancy, we develop the cross-modal match module. CALF establishes state-of-the-art performance for both long-term and short-term forecasting tasks.
arXiv Detail & Related papers (2024-03-12T04:04:38Z)
Explaining Time Series via Contrastive and Locally Sparse Perturbations [45.055327583283315]
ContraLSP is a sparse model that introduces counterfactual samples to build uninformative perturbations but keeps distribution using contrastive learning. Empirical studies on both synthetic and real-world datasets show that ContraLSP outperforms state-of-the-art models.
arXiv Detail & Related papers (2024-01-16T18:27:37Z)
Hierarchical Adaptive Voxel-guided Sampling for Real-time Applications in Large-scale Point Clouds [6.094829692829813]
We propose a hierarchical adaptive voxel-guided point sampler with linear complexity and high parallelization for real-time applications. Our method achieves competitive performance with the most powerful FPS, at an amazing speed that is more than 100 times faster. Our sampler can be easily integrated into existing models and achieves a 20$sim$80% reduction in runtime with minimal effort.
arXiv Detail & Related papers (2023-05-23T17:45:49Z)
Boosting Few-shot Fine-grained Recognition with Background Suppression and Foreground Alignment [53.401889855278704]
Few-shot fine-grained recognition (FS-FGR) aims to recognize novel fine-grained categories with the help of limited available samples. We propose a two-stage background suppression and foreground alignment framework, which is composed of a background activation suppression (BAS) module, a foreground object alignment (FOA) module, and a local to local (L2L) similarity metric. Experiments conducted on multiple popular fine-grained benchmarks demonstrate that our method outperforms the existing state-of-the-art by a large margin.
arXiv Detail & Related papers (2022-10-04T07:54:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.