A Runtime-Adaptive Transformer Neural Network Accelerator on FPGAs
- URL: http://arxiv.org/abs/2411.18148v1
- Date: Wed, 27 Nov 2024 08:53:19 GMT
- Title: A Runtime-Adaptive Transformer Neural Network Accelerator on FPGAs
- Authors: Ehsan Kabir, Austin R. J. Downey, Jason D. Bakos, David Andrews, Miaoqing Huang,
- Abstract summary: ADAPTOR is a runtime-adaptive accelerator for dense matrix computations in transformer encoders and decoders on FPGAs.
It incorporates efficient matrix tiling to distribute resources across FPGA platforms.
It achieves a speedup of 1.7 to 2.25$times$ compared to some state-of-the-art FPGA-based accelerators.
- Score: 0.0
- License:
- Abstract: Transformer neural networks (TNN) excel in natural language processing (NLP), machine translation, and computer vision (CV) without relying on recurrent or convolutional layers. However, they have high computational and memory demands, particularly on resource-constrained devices like FPGAs. Moreover, transformer models vary in processing time across applications, requiring custom models with specific parameters. Designing custom accelerators for each model is complex and time-intensive. Some custom accelerators exist with no runtime adaptability, and they often rely on sparse matrices to reduce latency. However, hardware designs become more challenging due to the need for application-specific sparsity patterns. This paper introduces ADAPTOR, a runtime-adaptive accelerator for dense matrix computations in transformer encoders and decoders on FPGAs. ADAPTOR enhances the utilization of processing elements and on-chip memory, enhancing parallelism and reducing latency. It incorporates efficient matrix tiling to distribute resources across FPGA platforms and is fully quantized for computational efficiency and portability. Evaluations on Xilinx Alveo U55C data center cards and embedded platforms like VC707 and ZCU102 show that our design is 1.2$\times$ and 2.87$\times$ more power efficient than the NVIDIA K80 GPU and the i7-8700K CPU respectively. Additionally, it achieves a speedup of 1.7 to 2.25$\times$ compared to some state-of-the-art FPGA-based accelerators.
Related papers
- FAMOUS: Flexible Accelerator for the Attention Mechanism of Transformer on UltraScale+ FPGAs [0.0]
Transformer neural networks (TNNs) are being applied across a widening range of application domains, including natural language processing (NLP), machine translation, and computer vision (CV)
This paper proposes textitFAMOUS, a flexible hardware accelerator for dense multi-head attention computation of TNNs on field-programmable gate arrays (FPGAs)
It is optimized for high utilization of processing elements and on-chip memories to improve parallelism and reduce latency.
arXiv Detail & Related papers (2024-09-21T05:25:46Z) - ProTEA: Programmable Transformer Encoder Acceleration on FPGA [0.0]
Transformer neural networks (TNNs) have been widely utilized on a diverse range of applications, including natural language processing (NLP), machine translation, and computer vision (CV)
Despite the popularity of TNNs, there has been limited numbers of hardware accelerators targeting these two critical blocks.
This paper introduces textitProTEA, a programmable runtime accelerator tailored for the dense computations of state-of-the-art transformer encoders.
arXiv Detail & Related papers (2024-09-21T01:44:13Z) - Quasar-ViT: Hardware-Oriented Quantization-Aware Architecture Search for Vision Transformers [56.37495946212932]
Vision transformers (ViTs) have demonstrated their superior accuracy for computer vision tasks compared to convolutional neural networks (CNNs)
This work proposes Quasar-ViT, a hardware-oriented quantization-aware architecture search framework for ViTs.
arXiv Detail & Related papers (2024-07-25T16:35:46Z) - SWAT: Scalable and Efficient Window Attention-based Transformers Acceleration on FPGAs [3.302913401404089]
Sliding window-based static sparse attention mitigates the problem by limiting the attention scope of the input tokens.
We propose a dataflow-aware FPGA-based accelerator design, SWAT, that efficiently leverages the sparsity to achieve scalable performance for long input.
arXiv Detail & Related papers (2024-05-27T10:25:08Z) - X-Former: In-Memory Acceleration of Transformers [7.194491150684456]
Transformers have achieved great success in a wide variety of natural language processing (NLP) tasks due to the attention mechanism.
Traditional deep neural network (DNN) accelerators face limitations in processing Transformers efficiently.
In-memory accelerators based on non-volatile memory promise to be an effective solution to this challenge.
We present X-Former, a hybrid in-memory hardware accelerator that consists of both NVM and CMOS processing elements.
arXiv Detail & Related papers (2023-03-13T21:11:54Z) - Adaptable Butterfly Accelerator for Attention-based NNs via Hardware and
Algorithm Co-design [66.39546326221176]
Attention-based neural networks have become pervasive in many AI tasks.
The use of the attention mechanism and feed-forward network (FFN) demands excessive computational and memory resources.
This paper proposes a hardware-friendly variant that adopts a unified butterfly sparsity pattern to approximate both the attention mechanism and the FFNs.
arXiv Detail & Related papers (2022-09-20T09:28:26Z) - A Length Adaptive Algorithm-Hardware Co-design of Transformer on FPGA
Through Sparse Attention and Dynamic Pipelining [28.336502115532905]
This paper proposes a coherent sequence length adaptive algorithm-hardware co-design for Transformer acceleration.
We develop a hardware-friendly sparse attention operator and a length-aware hardware resource scheduling algorithm.
Our design has very small accuracy loss and has 80.2 $times$ and 2.6 $times$ speedup compared to CPU and GPU implementation.
arXiv Detail & Related papers (2022-08-07T05:48:38Z) - FPGA-optimized Hardware acceleration for Spiking Neural Networks [69.49429223251178]
This work presents the development of a hardware accelerator for an SNN, with off-line training, applied to an image recognition task.
The design targets a Xilinx Artix-7 FPGA, using in total around the 40% of the available hardware resources.
It reduces the classification time by three orders of magnitude, with a small 4.5% impact on the accuracy, if compared to its software, full precision counterpart.
arXiv Detail & Related papers (2022-01-18T13:59:22Z) - VAQF: Fully Automatic Software-hardware Co-design Framework for Low-bit
Vision Transformer [121.85581713299918]
We propose VAQF, a framework that builds inference accelerators on FPGA platforms for quantized Vision Transformers (ViTs)
Given the model structure and the desired frame rate, VAQF will automatically output the required quantization precision for activations.
This is the first time quantization has been incorporated into ViT acceleration on FPGAs.
arXiv Detail & Related papers (2022-01-17T20:27:52Z) - Stable, Fast and Accurate: Kernelized Attention with Relative Positional
Encoding [63.539333383965726]
We propose a novel way to accelerate attention calculation for Transformers with relative positional encoding (RPE)
Based upon the observation that relative positional encoding forms a Toeplitz matrix, we mathematically show that kernelized attention with RPE can be calculated efficiently using Fast Fourier Transform (FFT)
arXiv Detail & Related papers (2021-06-23T17:51:26Z) - EdgeBERT: Sentence-Level Energy Optimizations for Latency-Aware
Multi-Task NLP Inference [82.1584439276834]
Transformer-based language models such as BERT provide significant accuracy improvement for a multitude of natural language processing (NLP) tasks.
We present EdgeBERT, an in-depth algorithm- hardware co-design for latency-aware energy optimization for multi-task NLP.
arXiv Detail & Related papers (2020-11-28T19:21:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.