Adaptable Butterfly Accelerator for Attention-based NNs via Hardware and
Algorithm Co-design
- URL: http://arxiv.org/abs/2209.09570v1
- Date: Tue, 20 Sep 2022 09:28:26 GMT
- Title: Adaptable Butterfly Accelerator for Attention-based NNs via Hardware and
Algorithm Co-design
- Authors: Hongxiang Fan, Thomas Chau, Stylianos I. Venieris, Royson Lee,
Alexandros Kouris, Wayne Luk, Nicholas D. Lane, Mohamed S. Abdelfattah
- Abstract summary: Attention-based neural networks have become pervasive in many AI tasks.
The use of the attention mechanism and feed-forward network (FFN) demands excessive computational and memory resources.
This paper proposes a hardware-friendly variant that adopts a unified butterfly sparsity pattern to approximate both the attention mechanism and the FFNs.
- Score: 66.39546326221176
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Attention-based neural networks have become pervasive in many AI tasks.
Despite their excellent algorithmic performance, the use of the attention
mechanism and feed-forward network (FFN) demands excessive computational and
memory resources, which often compromises their hardware performance. Although
various sparse variants have been introduced, most approaches only focus on
mitigating the quadratic scaling of attention on the algorithm level, without
explicitly considering the efficiency of mapping their methods on real hardware
designs. Furthermore, most efforts only focus on either the attention mechanism
or the FFNs but without jointly optimizing both parts, causing most of the
current designs to lack scalability when dealing with different input lengths.
This paper systematically considers the sparsity patterns in different variants
from a hardware perspective. On the algorithmic level, we propose FABNet, a
hardware-friendly variant that adopts a unified butterfly sparsity pattern to
approximate both the attention mechanism and the FFNs. On the hardware level, a
novel adaptable butterfly accelerator is proposed that can be configured at
runtime via dedicated hardware control to accelerate different butterfly layers
using a single unified hardware engine. On the Long-Range-Arena dataset, FABNet
achieves the same accuracy as the vanilla Transformer while reducing the amount
of computation by 10 to 66 times and the number of parameters 2 to 22 times. By
jointly optimizing the algorithm and hardware, our FPGA-based butterfly
accelerator achieves 14.2 to 23.2 times speedup over state-of-the-art
accelerators normalized to the same computational budget. Compared with
optimized CPU and GPU designs on Raspberry Pi 4 and Jetson Nano, our system is
up to 273.8 and 15.1 times faster under the same power budget.
Related papers
- SWAT: Scalable and Efficient Window Attention-based Transformers Acceleration on FPGAs [3.302913401404089]
Sliding window-based static sparse attention mitigates the problem by limiting the attention scope of the input tokens.
We propose a dataflow-aware FPGA-based accelerator design, SWAT, that efficiently leverages the sparsity to achieve scalable performance for long input.
arXiv Detail & Related papers (2024-05-27T10:25:08Z) - All-to-all reconfigurability with sparse and higher-order Ising machines [0.0]
We evaluate p-computers based on Ising Machines (IM) or p-computers with a benchmark implementations of an open optimization problem.
The 3R3X problem has a glassy energy landscape, and it has recently been used to benchmark various IMs and other solvers.
We implement this architecture in FPGAs and show that p-bit networks running an adaptive version of the powerful parallel tempering algorithm demonstrate competitive algorithmic and prefactor advantages.
arXiv Detail & Related papers (2023-11-21T20:27:02Z) - An Algorithm-Hardware Co-Optimized Framework for Accelerating N:M Sparse
Transformers [11.811907838840712]
We propose an algorithm-hardware co-optimized framework to flexibly and efficiently accelerate Transformers by utilizing general N:M sparsity patterns.
We present a flexible and efficient hardware architecture, namely STA, to achieve significant speedup when deploying N:M sparse Transformers.
Experimental results show that compared to other methods, N:M sparse Transformers, generated using IDP, achieves an average of 6.7% improvement on accuracy with high training efficiency.
arXiv Detail & Related papers (2022-08-12T04:51:49Z) - A Length Adaptive Algorithm-Hardware Co-design of Transformer on FPGA
Through Sparse Attention and Dynamic Pipelining [28.336502115532905]
This paper proposes a coherent sequence length adaptive algorithm-hardware co-design for Transformer acceleration.
We develop a hardware-friendly sparse attention operator and a length-aware hardware resource scheduling algorithm.
Our design has very small accuracy loss and has 80.2 $times$ and 2.6 $times$ speedup compared to CPU and GPU implementation.
arXiv Detail & Related papers (2022-08-07T05:48:38Z) - FPGA-optimized Hardware acceleration for Spiking Neural Networks [69.49429223251178]
This work presents the development of a hardware accelerator for an SNN, with off-line training, applied to an image recognition task.
The design targets a Xilinx Artix-7 FPGA, using in total around the 40% of the available hardware resources.
It reduces the classification time by three orders of magnitude, with a small 4.5% impact on the accuracy, if compared to its software, full precision counterpart.
arXiv Detail & Related papers (2022-01-18T13:59:22Z) - OMPQ: Orthogonal Mixed Precision Quantization [64.59700856607017]
Mixed precision quantization takes advantage of hardware's multiple bit-width arithmetic operations to unleash the full potential of network quantization.
We propose to optimize a proxy metric, the concept of networkity, which is highly correlated with the loss of the integer programming.
This approach reduces the search time and required data amount by orders of magnitude, with little compromise on quantization accuracy.
arXiv Detail & Related papers (2021-09-16T10:59:33Z) - HANT: Hardware-Aware Network Transformation [82.54824188745887]
We propose hardware-aware network transformation (HANT)
HANT replaces inefficient operations with more efficient alternatives using a neural architecture search like approach.
Our results on accelerating the EfficientNet family show that HANT can accelerate them by up to 3.6x with 0.4% drop in the top-1 accuracy on the ImageNet dataset.
arXiv Detail & Related papers (2021-07-12T18:46:34Z) - FastFlowNet: A Lightweight Network for Fast Optical Flow Estimation [81.76975488010213]
Dense optical flow estimation plays a key role in many robotic vision tasks.
Current networks often occupy large number of parameters and require heavy computation costs.
Our proposed FastFlowNet works in the well-known coarse-to-fine manner with following innovations.
arXiv Detail & Related papers (2021-03-08T03:09:37Z) - EdgeBERT: Sentence-Level Energy Optimizations for Latency-Aware
Multi-Task NLP Inference [82.1584439276834]
Transformer-based language models such as BERT provide significant accuracy improvement for a multitude of natural language processing (NLP) tasks.
We present EdgeBERT, an in-depth algorithm- hardware co-design for latency-aware energy optimization for multi-task NLP.
arXiv Detail & Related papers (2020-11-28T19:21:47Z) - SPEC2: SPECtral SParsE CNN Accelerator on FPGAs [31.31419913907224]
We propose SPEC2 -- the first work to prune and accelerate spectral CNNs.
We design an optimized pipeline architecture on FPGA that has efficient random access into sparse kernels.
The resulting accelerators achieve up to 24x higher throughput, compared with the state-of-the-art FPGA implementations for VGG16.
arXiv Detail & Related papers (2019-10-16T23:30:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.