Related papers: SpAtten: Efficient Sparse Attention Architecture with Cascade Token and Head Pruning

SpAtten: Efficient Sparse Attention Architecture with Cascade Token and Head Pruning

URL: http://arxiv.org/abs/2012.09852v3
Date: Thu, 18 Jul 2024 18:48:38 GMT
Title: SpAtten: Efficient Sparse Attention Architecture with Cascade Token and Head Pruning
Authors: Hanrui Wang, Zhekai Zhang, Song Han,
Abstract summary: We present SpAtten, an efficient algorithm-architecture co-design that leverages token sparsity, head sparsity, and quantization opportunities to reduce the attention computation and memory access. Experiments on 30 benchmarks show that, on average, SpAtten reduces DRAM access by 10.0x with no accuracy loss, and achieves 1.6x, 3.0x, 162x, 347x speedup, and 1,4x, 3.2x, 1193x, 4059x energy savings.
Score: 10.981433334942476
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The attention mechanism is becoming increasingly popular in Natural Language Processing (NLP) applications, showing superior performance than convolutional and recurrent architectures. However, attention becomes the compution bottleneck because of its quadratic computational complexity to input length, complicated data movement and low arithmetic intensity. Moreover, existing NN accelerators mainly focus on optimizing convolutional or recurrent models, and cannot efficiently support attention. In this paper, we present SpAtten, an efficient algorithm-architecture co-design that leverages token sparsity, head sparsity, and quantization opportunities to reduce the attention computation and memory access. Inspired by the high redundancy of human languages, we propose the novel cascade token pruning to prune away unimportant tokens in the sentence. We also propose cascade head pruning to remove unessential heads. Cascade pruning is fundamentally different from weight pruning since there is no trainable weight in the attention mechanism, and the pruned tokens and heads are selected on the fly. To efficiently support them on hardware, we design a novel top-k engine to rank token and head importance scores with high throughput. Furthermore, we propose progressive quantization that first fetches MSBs only and performs the computation; if the confidence is low, it fetches LSBs and recomputes the attention outputs, trading computation for memory reduction. Extensive experiments on 30 benchmarks show that, on average, SpAtten reduces DRAM access by 10.0x with no accuracy loss, and achieves 1.6x, 3.0x, 162x, 347x speedup, and 1,4x, 3.2x, 1193x, 4059x energy savings over A3 accelerator, MNNFast accelerator, TITAN Xp GPU, Xeon CPU, respectively.

Related papers

ProxyAttn: Guided Sparse Attention via Representative Heads [59.03412871683236]
We propose ProxyAttn, a training-free sparse attention algorithm that achieves more precise block estimation.<n>We show that ProxyAttn can achieve up to 10.3x attention acceleration and 2.4x prefilling acceleration without significant performance loss.
arXiv Detail & Related papers (2025-09-29T13:10:39Z)
SLA: Beyond Sparsity in Diffusion Transformers via Fine-Tunable Sparse-Linear Attention [88.47701139980636]
In Diffusion Transformer (DiT) models, particularly for video generation, attention latency is a major bottleneck.<n>We find that attention weights can be separated into two parts: a small fraction of large weights with high rank and the remaining weights with very low rank.<n>We propose SLA, a trainable attention method that fuses sparse and linear attention to accelerate diffusion models.
arXiv Detail & Related papers (2025-09-28T17:58:59Z)
Accelerating 3D Gaussian Splatting with Neural Sorting and Axis-Oriented Rasterization [14.87046071090259]
3D Gaussian Splatting (3DGS) has recently gained significant attention for high-quality and efficient view synthesis.<n>Despite its impressive algorithmic performance, real-time rendering on resource-constrained devices remains a major challenge due to tight power and area budgets.
arXiv Detail & Related papers (2025-06-08T10:14:54Z)
Spark Transformer: Reactivating Sparsity in FFN and Attention [63.20677098823873]
We introduce Spark Transformer, a novel architecture that achieves a high level of activation sparsity in both FFN and the attention mechanism.<n>This sparsity translates to a 2.5x reduction in FLOPs, leading to decoding wall-time speedups of up to 1.79x on CPU and 1.40x on GPU.
arXiv Detail & Related papers (2025-06-07T03:51:13Z)
Sparse VideoGen2: Accelerate Video Generation with Sparse Attention via Semantic-Aware Permutation [57.56385490252605]
Diffusion Transformers (DiTs) are essential for video generation but suffer from significant latency due to the quadratic complexity of attention.<n>We propose SVG2, a training-free framework that maximizes identification accuracy and computation minimizes waste.
arXiv Detail & Related papers (2025-05-24T21:30:29Z)
Delta Attention: Fast and Accurate Sparse Attention Inference by Delta Correction [52.14200610448542]
A transformer has a quadratic complexity, leading to high inference costs and latency for long sequences.<n>We propose a simple, novel, and effective procedure for correcting this distributional shift.<n>Our method can maintain approximately 98.5% sparsity over full quadratic attention, making our model 32 times faster than Flash Attention 2 when processing 1M token prefills.
arXiv Detail & Related papers (2025-05-16T13:48:33Z)
A Reconfigurable Stream-Based FPGA Accelerator for Bayesian Confidence Propagation Neural Networks [0.0]
Brain-inspired algorithms are attractive and emerging alternatives to classical deep learning methods. BCPNN is an important tool for both machine learning and computational neuroscience research. BCPNN can reach state-of-the-art performance in tasks such as learning and memory recall compared to other models. We design a custom stream-based accelerator for BCPNN using Field-Programmable Gate Arrays (FPGA) using Xilinx Vitis High-Level Synthesis (HLS) flow.
arXiv Detail & Related papers (2025-03-03T14:06:43Z)
Kolmogorov-Arnold Transformer [72.88137795439407]
We introduce the Kolmogorov-Arnold Transformer (KAT), a novel architecture that replaces layers with Kolmogorov-Arnold Network (KAN) layers. We identify three key challenges: (C1) Base function, (C2) Inefficiency, and (C3) Weight. With these designs, KAT outperforms traditional-based transformers.
arXiv Detail & Related papers (2024-09-16T17:54:51Z)
Hybrid Dynamic Pruning: A Pathway to Efficient Transformer Inference [1.0919012968294923]
We introduce a novel algorithm-architecture co-design approach that accelerates transformers using head sparsity, block sparsity and approximation opportunities to reduce computations in attention and reduce memory access. With the observation of the huge redundancy in attention scores and attention heads, we propose a novel integer-based row-balanced block pruning to prune unimportant blocks in the attention matrix at run time. Also propose integer-based head pruning to detect and prune unimportant heads at an early stage at run time.
arXiv Detail & Related papers (2024-07-17T11:15:16Z)
Lean Attention: Hardware-Aware Scalable Attention Mechanism for the Decode-Phase of Transformers [4.674454841332859]
Transformer-based models have emerged as one of the most widely used architectures for natural language processing. These huge models are memory hungry and incur significant inference latency even on cutting edge AI-accelerators. We propose LeanAttention, a scalable technique of computing self-attention for the token-generation phase.
arXiv Detail & Related papers (2024-05-17T00:52:39Z)
SNP: Structured Neuron-level Pruning to Preserve Attention Scores [2.4204190488008046]
Multi-head self-attention (MSA) is a key component of Vision Transformers (ViTs) We propose a novel graph-aware neuron-level pruning method, Structured Neuron-level Pruning (SNP) Our proposed method effectively compresses and accelerates Transformer-based models for both edge devices and server processors.
arXiv Detail & Related papers (2024-04-18T03:21:28Z)
Simple linear attention language models balance the recall-throughput tradeoff [60.06020449520365]
We propose BASED, a simple architecture combining linear and sliding window attention. We train language models up to 1.3b parameters and show that BASED matches the strongest sub-quadratic models in perplexity and outperforms them on real-world recall-intensive tasks by 6.22 accuracy points.
arXiv Detail & Related papers (2024-02-28T19:28:27Z)
Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity [12.663030430488922]
We propose Flash-LLM for enabling low-cost and highly-efficient large generative model inference on high-performance Cores. At SpMM kernel level, Flash-LLM significantly outperforms the state-of-the-art library, i.e., Sputnik and SparTA by an average of 2.9x and 1.5x, respectively.
arXiv Detail & Related papers (2023-09-19T03:20:02Z)
INR-Arch: A Dataflow Architecture and Compiler for Arbitrary-Order Gradient Computations in Implicit Neural Representation Processing [66.00729477511219]
Given a function represented as a computation graph, traditional architectures face challenges in efficiently computing its nth-order gradient. We introduce INR-Arch, a framework that transforms the computation graph of an nth-order gradient into a hardware-optimized dataflow architecture. We present results that demonstrate 1.8-4.8x and 1.5-3.6x speedup compared to CPU and GPU baselines respectively.
arXiv Detail & Related papers (2023-08-11T04:24:39Z)
UNETR++: Delving into Efficient and Accurate 3D Medical Image Segmentation [93.88170217725805]
We propose a 3D medical image segmentation approach, named UNETR++, that offers both high-quality segmentation masks as well as efficiency in terms of parameters, compute cost, and inference speed. The core of our design is the introduction of a novel efficient paired attention (EPA) block that efficiently learns spatial and channel-wise discriminative features. Our evaluations on five benchmarks, Synapse, BTCV, ACDC, BRaTs, and Decathlon-Lung, reveal the effectiveness of our contributions in terms of both efficiency and accuracy.
arXiv Detail & Related papers (2022-12-08T18:59:57Z)
Adaptable Butterfly Accelerator for Attention-based NNs via Hardware and Algorithm Co-design [66.39546326221176]
Attention-based neural networks have become pervasive in many AI tasks. The use of the attention mechanism and feed-forward network (FFN) demands excessive computational and memory resources. This paper proposes a hardware-friendly variant that adopts a unified butterfly sparsity pattern to approximate both the attention mechanism and the FFNs.
arXiv Detail & Related papers (2022-09-20T09:28:26Z)
FastFlowNet: A Lightweight Network for Fast Optical Flow Estimation [81.76975488010213]
Dense optical flow estimation plays a key role in many robotic vision tasks. Current networks often occupy large number of parameters and require heavy computation costs. Our proposed FastFlowNet works in the well-known coarse-to-fine manner with following innovations.
arXiv Detail & Related papers (2021-03-08T03:09:37Z)
Taming GANs with Lookahead-Minmax [63.90038365274479]
Experimental results on MNIST, SVHN, CIFAR-10, and ImageNet demonstrate a clear advantage of combining Lookahead-minmax with Adam or extragradient. Using 30-fold fewer parameters and 16-fold smaller minibatches we outperform the reported performance of the class-dependent BigGAN on CIFAR-10 by obtaining FID of 12.19 without using the class labels.
arXiv Detail & Related papers (2020-06-25T17:13:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.