SystolicAttention: Fusing FlashAttention within a Single Systolic Array
- URL: http://arxiv.org/abs/2507.11331v2
- Date: Wed, 16 Jul 2025 13:36:16 GMT
- Title: SystolicAttention: Fusing FlashAttention within a Single Systolic Array
- Authors: Jiawei Lin, Guokai Chen, Yuanlong Li, Thomas Bourgeat,
- Abstract summary: Transformer models rely heavily on scaled dot-product attention (SDPA)<n>Current systolic-array-based accelerators face significant challenges when executing FlashAttention.<n>We propose FSA, an enhanced systolic array architecture that enables the entire FlashAttention algorithm to run entirely within a single systolic array.
- Score: 2.8650887057567864
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Transformer models rely heavily on scaled dot-product attention (SDPA), typically implemented using the FlashAttention algorithm. However, current systolic-array-based accelerators face significant challenges when executing FlashAttention. Systolic arrays can only achieve high utilization for consecutive and large matrix multiplications. In contrast, FlashAttention requires frequently interleaved matrix multiplications and softmax operations. The frequent data swaps between the systolic array and external vector units result in low systolic array utilization. This is further exacerbated by the fact that softmax involves numerous non-matrix operations, which are not well-suited for systolic arrays. Moreover, the concurrent execution of matrix multiplication on systolic arrays and softmax on vector units leads to register file and SRAM port contention, further degrading performance. To overcome these limitations, we propose FSA, an enhanced systolic array architecture that enables the entire FlashAttention algorithm to run entirely within a single systolic array, eliminating the need for external vector units. At the core of FSA is SystolicAttention, a novel scheduling algorithm that maps FlashAttention operations onto systolic arrays with fine-grained, element-wise overlap. This significantly improves array utilization while preserving the original floating-point operation order to maintain numerical stability. We implement FSA in synthesizable RTL and evaluate its performance against state-of-the-art commercial accelerators. Our results show that FSA achieves 1.77x and 4.83x higher attention FLOPs/s utilization compared to AWS NeuronCore-v2 and Google TPUv5e, respectively, with only about 10% area overhead.
Related papers
- Orthogonal Finetuning Made Scalable [87.49040247077389]
Orthogonal finetuning (OFT) offers highly parameter-efficient adaptation while preventing catastrophic forgetting, but its high runtime and memory demands limit practical deployment.<n>We identify the core computational bottleneck in OFT as its weight-centric implementation, which relies on costly matrix-matrix multiplications with cubic complexity.<n>We propose OFTv2, an input-centric reformulation that instead uses matrix-vector multiplications (i.e., matrix-free computation), reducing the computational cost to quadratic.<n>These modifications allow OFTv2 to achieve up to 10x faster training and 3x lower GPU memory usage without compromising performance.
arXiv Detail & Related papers (2025-06-24T17:59:49Z) - FLASH-D: FlashAttention with Hidden Softmax Division [3.668018928502405]
Building on online softmax computation, FlashAttention integrates softmax calculation with matrix arithmetic.<n>This work presents FLASH-D a mathematically equivalent, yet simplified, formulation that achieves: (a) hiding softmax division within other non-linear function evaluations; (b) inherently numerically stable computation of exponentials; and (c) a reduction in computational cost without introducing numerical approximations to the FlashAttention kernel.
arXiv Detail & Related papers (2025-05-20T11:01:33Z) - Periodic Online Testing for Sparse Systolic Tensor Arrays [0.0]
Modern Machine Learning (ML) applications often benefit from structured sparsity, a technique that efficiently reduces model complexity and simplifies handling of sparse data in hardware.<n>This paper introduces an online error-checking technique capable of detecting and locating permanent faults within sparse systolic tensor arrays before vectors begin.
arXiv Detail & Related papers (2025-04-25T18:10:45Z) - Tiled Flash Linear Attention: More Efficient Linear RNN and xLSTM Kernels [14.756974816917584]
Linear RNNs with gating recently demonstrated competitive performance compared to Transformers in language modeling.<n>We present Tiled Flash Linear Attention (TFLA), a novel kernel algorithm for linear RNNs.<n>In our speed benchmarks, we show that our new mLSTM kernels based on TFLA outperform highly optimized Flash Attention, Linear Attention and Mamba kernels.
arXiv Detail & Related papers (2025-03-18T16:09:47Z) - Sliding Window Attention Training for Efficient Large Language Models [55.56483740523027]
We introduce SWAT, which enables efficient long-context handling via Sliding Window Attention Training.<n>This paper first attributes the inefficiency of Transformers to the attention sink phenomenon.<n>We replace softmax with the sigmoid function and utilize a balanced ALiBi and Rotary Position Embedding for efficient information compression and retention.
arXiv Detail & Related papers (2025-02-26T05:31:44Z) - FlashSparse: Minimizing Computation Redundancy for Fast Sparse Matrix Multiplications on Tensor Cores [6.404201720333765]
We propose FlashSparse, a novel approach to bridge the gap between sparse workloads and the TCU architecture.<n>Specifically, FlashSparse minimizes the sparse granularity for SpMM and SDDMM on TCUs through a novel swap-and-transpose matrix multiplication strategy.<n>We show that FlashSparse sets a new state-of-the-art for sparse matrix multiplications (geometric mean 5.5x speedup over DTC-SpMM and 3.22x speedup over RoDe)
arXiv Detail & Related papers (2024-12-15T01:12:33Z) - EoRA: Fine-tuning-free Compensation for Compressed LLM with Eigenspace Low-Rank Approximation [84.70637613266835]
EoRA is a fine-tuning-free method that augments compressed Large Language Models with low-rank matrices.<n>EoRA consistently outperforms prior training-free low rank methods in recovering the accuracy of compressed LLMs.
arXiv Detail & Related papers (2024-10-28T17:59:03Z) - Fast Matrix Multiplications for Lookup Table-Quantized LLMs [58.11584672945781]
FLUTE is a flexible lookup table engine for LUT-quantized LLMs.<n>At batch sizes 32 and quantization group size of 128, the FLUTE kernel can be 2-4x faster than existing GEMM kernels.
arXiv Detail & Related papers (2024-07-15T17:55:42Z) - HiRE: High Recall Approximate Top-$k$ Estimation for Efficient LLM
Inference [68.59839755875252]
HiRE comprises of two novel components: (i) a compression scheme to cheaply predict top-$k$ rows/columns with high recall, followed by full computation restricted to the predicted subset, and (ii) DA-TOP-$k$: an efficient multi-device approximate top-$k$ operator.
We demonstrate that on a one billion parameter model, HiRE applied to both the softmax as well as feedforward layers, achieves almost matching pretraining and downstream accuracy, and speeds up inference latency by $1.47times$ on a single TPUv5e device.
arXiv Detail & Related papers (2024-02-14T18:04:36Z) - Adaptable Butterfly Accelerator for Attention-based NNs via Hardware and
Algorithm Co-design [66.39546326221176]
Attention-based neural networks have become pervasive in many AI tasks.
The use of the attention mechanism and feed-forward network (FFN) demands excessive computational and memory resources.
This paper proposes a hardware-friendly variant that adopts a unified butterfly sparsity pattern to approximate both the attention mechanism and the FFNs.
arXiv Detail & Related papers (2022-09-20T09:28:26Z) - RASA: Efficient Register-Aware Systolic Array Matrix Engine for CPU [6.436294460697506]
We propose RASA, Register-Aware Systolic Array.
We develop techniques to divide an execution stage into several sub-stages and overlap instructions to hide overheads and run them concurrently.
RASA-based designs improve performance significantly with negligible area and power overhead.
arXiv Detail & Related papers (2021-10-05T00:01:31Z) - FlexSA: Flexible Systolic Array Architecture for Efficient Pruned DNN
Model Training [1.718730454558804]
We find that pruning a model using a common training accelerator with large systolic arrays is extremely performance-inefficient.
To make a systolic array efficient for pruning and training, we propose FlexSA, a flexible systolic array architecture.
We also present a compilation for tiling matrix-multiplication-and-accumulation operations in a training workload to best utilize the resources of FlexSA.
arXiv Detail & Related papers (2020-04-27T15:51:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.