SageAttention2++: A More Efficient Implementation of SageAttention2
- URL: http://arxiv.org/abs/2505.21136v3
- Date: Fri, 06 Jun 2025 07:47:22 GMT
- Title: SageAttention2++: A More Efficient Implementation of SageAttention2
- Authors: Jintao Zhang, Xiaoming Xu, Jia Wei, Haofeng Huang, Pengle Zhang, Chendong Xiang, Jun Zhu, Jianfei Chen,
- Abstract summary: We propose to utilize the faster instruction of FP8 Matmul accumulated in FP16 to accelerate SageAttention2.<n>Experiments show that SageAttention2++ achieves a 3.9x speedup over FlashAttention while maintaining the same attention accuracy as SageAttention2.
- Score: 21.70605866986346
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The efficiency of attention is critical because its time complexity grows quadratically with sequence length. SageAttention2 addresses this by utilizing quantization to accelerate matrix multiplications (Matmul) in attention. To further accelerate SageAttention2, we propose to utilize the faster instruction of FP8 Matmul accumulated in FP16. The instruction is 2x faster than the FP8 Matmul used in SageAttention2. Our experiments show that SageAttention2++ achieves a 3.9x speedup over FlashAttention while maintaining the same attention accuracy as SageAttention2. This means SageAttention2++ effectively accelerates various models, including those for language, image, and video generation, with negligible end-to-end metrics loss. The code will be available at https://github.com/thu-ml/SageAttention.
Related papers
- SparseLoRA: Accelerating LLM Fine-Tuning with Contextual Sparsity [52.88892280536302]
We introduce SparseLoRA, a method that accelerates fine-tuning through contextual sparsity.<n>We show that SparseLoRA reduces computational cost by up to 2.2 times and a measured speedup of up to 1.6 times.
arXiv Detail & Related papers (2025-06-19T17:53:34Z) - SageAttention2: Efficient Attention with Thorough Outlier Smoothing and Per-thread INT4 Quantization [22.551095978580147]
We propose SageAttention2, which utilizes significantly faster 4-bit matrix multiplication (Matmul) alongside additional precision-enhancing techniques.<n>Our approach incurs negligible end-to-end metrics loss across diverse models, including those for language, image, and video generation.
arXiv Detail & Related papers (2024-11-17T04:35:49Z) - An Efficient Matrix Multiplication Algorithm for Accelerating Inference in Binary and Ternary Neural Networks [8.779871128906787]
We propose algorithms to improve the inference time and memory efficiency of Deep Neural Networks (DNNs)<n>We focus on matrix multiplication as the bottleneck operation of inference.<n>Our experiments show up to a 5.24x speedup in the inference time.
arXiv Detail & Related papers (2024-11-10T04:56:14Z) - SageAttention: Accurate 8-Bit Attention for Plug-and-play Inference Acceleration [22.551095978580147]
We propose SageAttention, a highly efficient and accurate quantization method for attention.<n>Our approach incurs almost no end-to-end metrics loss across diverse models.
arXiv Detail & Related papers (2024-10-03T10:25:23Z) - Rotated Runtime Smooth: Training-Free Activation Smoother for accurate INT4 inference [54.2589824716527]
Large language models incur substantial computation and memory movement costs due to their large scale.
Existing approaches separate outliers and normal values into two matrices or migrate outliers from activations to weights, suffering from high latency or accuracy degradation.
We propose Rotated Smooth (RRS), a plug-and-play activation smoother for quantization, consisting of Smooth and Rotation operation.
The proposed method outperforms the state-of-the-art method in the LLaMA and Qwen families and improves WikiText-2 perplexity from 57.33 to 6.66 for INT4 inference.
arXiv Detail & Related papers (2024-09-30T14:59:22Z) - S2-Attention: Hardware-Aware Context Sharding Among Attention Heads [49.1454481007861]
Sparse attention selectively attends to a subset of tokens in the context.<n>It remains unclear whether sparse attention can maintain the model's quality at a scale of today's large language models.<n>This paper presents Sparsely-Sharded(S2) Attention, a Triton library that provides kernel optimization for sparse attention customizable at both per-head and per-context-range levels.
arXiv Detail & Related papers (2024-07-25T00:27:07Z) - Hardware-Aware Parallel Prompt Decoding for Memory-Efficient Acceleration of LLM Inference [19.167604927651073]
Auto-regressive decoding of Large Language Models (LLMs) results in significant overheads in their hardware performance.
We propose a novel parallel prompt decoding that requires only $0.0002$% trainable parameters, enabling efficient training on a single A100-40GB GPU in just 16 hours.
Our approach demonstrates up to 2.49$times$ speedup and maintains a minimal memory overhead of just $0.0004$%.
arXiv Detail & Related papers (2024-05-28T22:19:30Z) - Bifurcated Attention: Accelerating Massively Parallel Decoding with Shared Prefixes in LLMs [39.16152482491236]
Bifurcated attention is a method designed to enhance language model inference in shared-context batch decoding scenarios.
Our approach addresses the challenge of redundant memory IO costs, a critical factor contributing to latency in high batch sizes and extended context lengths.
arXiv Detail & Related papers (2024-03-13T16:30:57Z) - HyperAttention: Long-context Attention in Near-Linear Time [78.33061530066185]
We present an approximate attention mechanism named HyperAttention to address the computational challenges posed by the growing complexity of long contexts.
Empirically, employing Locality Sensitive Hashing (LSH) to identify large entries, HyperAttention outperforms existing methods.
We validate the empirical performance of HyperAttention on a variety of different long-context length datasets.
arXiv Detail & Related papers (2023-10-09T17:05:25Z) - DISTFLASHATTN: Distributed Memory-efficient Attention for Long-context LLMs Training [82.06732962485754]
FlashAttention effectively reduces the quadratic peak memory usage to linear in training transformer-based large language models (LLMs) on a single GPU.
We introduce DISTFLASHATTN, a memory-efficient attention mechanism optimized for long-context LLMs training.
It achieves 1.67x and 1.26 - 1.88x speedup compared to recent Ring Attention and DeepSpeed-Ulysses.
arXiv Detail & Related papers (2023-10-05T03:47:57Z) - PowerFusion: A Tensor Compiler with Explicit Data Movement Description
and Instruction-level Graph IR [10.059491353103526]
We propose IntelliGen, a tensor compiler that can generate high-performance code for memory-intensive operators.
IntelliGen considers both computation and data movement optimizations.
We evaluate IntelliGen on NVIDIA GPU, AMD GPU, and Cambricon MLU, showing speedup up to 1.97x, 2.93x, and 16.91x (1.28x, 1.23x, and 2.31x on average)
arXiv Detail & Related papers (2023-07-11T03:17:40Z) - FlashAttention: Fast and Memory-Efficient Exact Attention with
IO-Awareness [80.3586155104237]
FlashAttention is an IO-aware exact attention algorithm for Transformers.
It reduces the number of memory reads/writes between GPU high bandwidth memory (HBM) and GPU on-chip.
FlashAttention and block-sparse FlashAttention enable longer context in Transformers.
arXiv Detail & Related papers (2022-05-27T17:53:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.