SWAT: Scalable and Efficient Window Attention-based Transformers Acceleration on FPGAs
- URL: http://arxiv.org/abs/2405.17025v1
- Date: Mon, 27 May 2024 10:25:08 GMT
- Title: SWAT: Scalable and Efficient Window Attention-based Transformers Acceleration on FPGAs
- Authors: Zhenyu Bai, Pranav Dangi, Huize Li, Tulika Mitra,
- Abstract summary: Sliding window-based static sparse attention mitigates the problem by limiting the attention scope of the input tokens.
We propose a dataflow-aware FPGA-based accelerator design, SWAT, that efficiently leverages the sparsity to achieve scalable performance for long input.
- Score: 3.302913401404089
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Efficiently supporting long context length is crucial for Transformer models. The quadratic complexity of the self-attention computation plagues traditional Transformers. Sliding window-based static sparse attention mitigates the problem by limiting the attention scope of the input tokens, reducing the theoretical complexity from quadratic to linear. Although the sparsity induced by window attention is highly structured, it does not align perfectly with the microarchitecture of the conventional accelerators, leading to suboptimal implementation. In response, we propose a dataflow-aware FPGA-based accelerator design, SWAT, that efficiently leverages the sparsity to achieve scalable performance for long input. The proposed microarchitecture is based on a design that maximizes data reuse by using a combination of row-wise dataflow, kernel fusion optimization, and an input-stationary design considering the distributed memory and computation resources of FPGA. Consequently, it achieves up to 22$\times$ and 5.7$\times$ improvement in latency and energy efficiency compared to the baseline FPGA-based accelerator and 15$\times$ energy efficiency compared to GPU-based solution.
Related papers
- Implementation of high-efficiency, lightweight residual spiking neural network processor based on field-programmable gate arrays [0.49806798459446283]
This work presents an efficient residual SNN accelerator that combines algorithm and hardware co-design to optimize inference energy efficiency.<n>The proposed processor achieves a classification accuracy of 87.11% on the CIFAR-10 dataset, with an inference time of 3.98 ms per image and an energy efficiency of 183.5 FPS/W.
arXiv Detail & Related papers (2025-12-09T02:08:46Z) - Eliminating Multi-GPU Performance Taxes: A Systems Approach to Efficient Distributed LLMs [61.953548065938385]
We introduce the ''Three Taxes'' (Bulk Synchronous, Inter- Kernel Data Locality, and Kernel Launch Overhead) as an analytical framework.<n>We propose moving beyond the rigid BSP model to address key inefficiencies in distributed GPU execution.<n>We observe a 10-20% speedup in end-to-end latency over BSP-based approaches.
arXiv Detail & Related papers (2025-11-04T01:15:44Z) - Exact Nearest-Neighbor Search on Energy-Efficient FPGA Devices [10.725513609195]
The paper proposes two different energy-efficient solutions adopting the same FPGA low-level configuration.<n>The first solution maximizes system throughput by processing the queries of a batch in parallel over a streamed dataset.<n>The second minimizes latency by processing each kNN incoming query in parallel over an in-memory dataset.
arXiv Detail & Related papers (2025-10-19T07:29:16Z) - Joint Channel Estimation and Computation Offloading in Fluid Antenna-assisted MEC Networks [81.36647816787713]
We propose an FA-assisted offloading framework to minimize the delay of channel estimation.<n>We show that the proposed system significantly reduces the accuracy under efficient communication.
arXiv Detail & Related papers (2025-09-16T08:48:44Z) - Sliding Window Attention Training for Efficient Large Language Models [55.56483740523027]
We introduce SWAT, which enables efficient long-context handling via Sliding Window Attention Training.<n>This paper first attributes the inefficiency of Transformers to the attention sink phenomenon.<n>We replace softmax with the sigmoid function and utilize a balanced ALiBi and Rotary Position Embedding for efficient information compression and retention.
arXiv Detail & Related papers (2025-02-26T05:31:44Z) - InTAR: Inter-Task Auto-Reconfigurable Accelerator Design for High Data Volume Variation in DNNs [5.762543012823378]
InTAR is a novel accelerator design methodology for HDV applications on FPGAs.
It switches execution patterns automatically with a static schedule determined before circuit design.
InTAR achieves a high clock frequency with fewer resources and low reconfiguration time.
arXiv Detail & Related papers (2025-02-12T21:43:51Z) - Runtime Tunable Tsetlin Machines for Edge Inference on eFPGAs [0.2294388534633318]
eFPGAs allow for the design of hardware accelerators of edge Machine Learning (ML) applications at a lower power budget.
The limited eFPGA logic and memory significantly constrain compute capabilities and model size.
The proposed eFPGA accelerator focuses on minimizing resource usage and allowing flexibility for on-field recalibration over throughput.
arXiv Detail & Related papers (2025-02-10T12:49:22Z) - Sparser is Faster and Less is More: Efficient Sparse Attention for Long-Range Transformers [58.5711048151424]
We introduce SPARSEK Attention, a novel sparse attention mechanism designed to overcome computational and memory obstacles.
Our approach integrates a scoring network and a differentiable top-k mask operator, SPARSEK, to select a constant number of KV pairs for each query.
Experimental results reveal that SPARSEK Attention outperforms previous sparse attention methods.
arXiv Detail & Related papers (2024-06-24T15:55:59Z) - Transforming Image Super-Resolution: A ConvFormer-based Efficient Approach [58.57026686186709]
We introduce the Convolutional Transformer layer (ConvFormer) and propose a ConvFormer-based Super-Resolution network (CFSR)
CFSR inherits the advantages of both convolution-based and transformer-based approaches.
Experiments demonstrate that CFSR strikes an optimal balance between computational cost and performance.
arXiv Detail & Related papers (2024-01-11T03:08:00Z) - HEAT: Hardware-Efficient Automatic Tensor Decomposition for Transformer
Compression [69.36555801766762]
We propose a hardware-aware tensor decomposition framework, dubbed HEAT, that enables efficient exploration of the exponential space of possible decompositions.
We experimentally show that our hardware-aware factorized BERT variants reduce the energy-delay product by 5.7x with less than 1.1% accuracy loss.
arXiv Detail & Related papers (2022-11-30T05:31:45Z) - LL-GNN: Low Latency Graph Neural Networks on FPGAs for High Energy
Physics [45.666822327616046]
This work presents a novel reconfigurable architecture for Low Graph Neural Network (LL-GNN) designs for particle detectors.
The LL-GNN design advances the next generation of trigger systems by enabling sophisticated algorithms to process experimental data efficiently.
arXiv Detail & Related papers (2022-09-28T12:55:35Z) - Adaptable Butterfly Accelerator for Attention-based NNs via Hardware and
Algorithm Co-design [66.39546326221176]
Attention-based neural networks have become pervasive in many AI tasks.
The use of the attention mechanism and feed-forward network (FFN) demands excessive computational and memory resources.
This paper proposes a hardware-friendly variant that adopts a unified butterfly sparsity pattern to approximate both the attention mechanism and the FFNs.
arXiv Detail & Related papers (2022-09-20T09:28:26Z) - A Length Adaptive Algorithm-Hardware Co-design of Transformer on FPGA
Through Sparse Attention and Dynamic Pipelining [28.336502115532905]
This paper proposes a coherent sequence length adaptive algorithm-hardware co-design for Transformer acceleration.
We develop a hardware-friendly sparse attention operator and a length-aware hardware resource scheduling algorithm.
Our design has very small accuracy loss and has 80.2 $times$ and 2.6 $times$ speedup compared to CPU and GPU implementation.
arXiv Detail & Related papers (2022-08-07T05:48:38Z) - Easy and Efficient Transformer : Scalable Inference Solution For large
NLP mode [14.321889138798072]
This paper introduces a series of ultra-large-scale pre-training model optimization methods.
An inference engine -- Easy and Efficient Transformer (EET) is proposed.
EET achieves a 1.5-15x state-of-art speedup varying with context length.
arXiv Detail & Related papers (2021-04-26T11:00:56Z) - NullaNet Tiny: Ultra-low-latency DNN Inference Through Fixed-function
Combinational Logic [4.119948826527649]
Field-programmable gate array (FPGA)-based accelerators are gaining traction as a serious contender to replace graphics processing unit/central processing unit-based platforms.
This paper presents NullaNet Tiny, a framework for constructing resource and energy-efficient, ultra-low-latency FPGA-based neural network accelerators.
arXiv Detail & Related papers (2021-04-07T00:16:39Z) - EdgeBERT: Sentence-Level Energy Optimizations for Latency-Aware
Multi-Task NLP Inference [82.1584439276834]
Transformer-based language models such as BERT provide significant accuracy improvement for a multitude of natural language processing (NLP) tasks.
We present EdgeBERT, an in-depth algorithm- hardware co-design for latency-aware energy optimization for multi-task NLP.
arXiv Detail & Related papers (2020-11-28T19:21:47Z) - FTRANS: Energy-Efficient Acceleration of Transformers using FPGA [11.032972017827248]
We propose an efficient acceleration framework, Ftrans, for transformer-based large scale language representations.
Our framework significantly reduces the model size of NLP models by up to 16 times.
Our FPGA design achieves 27.07x and 81x improvement in performance and energy efficiency compared to CPU, and up to 8.80x improvement in energy efficiency compared to GPU.
arXiv Detail & Related papers (2020-07-16T18:58:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.