Related papers: A Cost-Efficient FPGA Implementation of Tiny Transformer Model using Neural ODE

A Cost-Efficient FPGA Implementation of Tiny Transformer Model using Neural ODE

URL: http://arxiv.org/abs/2401.02721v2
Date: Tue, 25 Jun 2024 13:49:31 GMT
Title: A Cost-Efficient FPGA Implementation of Tiny Transformer Model using Neural ODE
Authors: Ikumi Okubo, Keisuke Sugiura, Hiroki Matsutani,
Abstract summary: We propose a lightweight hybrid model which uses Neural ODE as a backbone instead of ResNet for 12.1$times$ parameter reduction. For the STL10 dataset, the proposed model achieves 80.15% top-1 accuracy which is comparable to ResNet50. The proposed FPGA implementation achieves a 34.01$times$ speedup for the backbone and MHSA parts, and it achieves an overall 9.85$times$ speedup when taking into account software pre- and post-processing.
Score: 0.8403582577557918
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Transformer has been adopted to a wide range of tasks and shown to outperform CNNs and RNNs while it suffers from high training cost and computational complexity. To address these issues, a hybrid approach has become a recent research trend, which replaces a part of ResNet with an MHSA (Multi-Head Self-Attention). In this paper, we propose a lightweight hybrid model which uses Neural ODE (Ordinary Differential Equation) as a backbone instead of ResNet for 12.1$\times$ parameter reduction. For the STL10 dataset, the proposed model achieves 80.15% top-1 accuracy which is comparable to ResNet50. Then, the proposed model is deployed on a modest-sized FPGA device for edge computing. To further reduce FPGA resource utilization, the model is quantized following QAT (Quantization Aware Training) scheme instead of PTQ (Post Training Quantization) to suppress the accuracy loss. As a result, an extremely lightweight Transformer-based model can be implemented on resource-limited FPGAs. The weights of the feature extraction network are stored on-chip to minimize the memory transfer overhead, allowing faster inference. By eliminating the overhead of memory transfers, inference can be executed seamlessly, leading to accelerated inference. The proposed FPGA implementation achieves a 34.01$\times$ speedup for the backbone and MHSA parts, and it achieves an overall 9.85$\times$ speedup when taking into account software pre- and post-processing. It also achieves an overall 7.10$\times$ higher energy efficiency compared to the ARM Cortex-A53 CPU.

Related papers

Quasar-ViT: Hardware-Oriented Quantization-Aware Architecture Search for Vision Transformers [56.37495946212932]
Vision transformers (ViTs) have demonstrated their superior accuracy for computer vision tasks compared to convolutional neural networks (CNNs) This work proposes Quasar-ViT, a hardware-oriented quantization-aware architecture search framework for ViTs.
arXiv Detail & Related papers (2024-07-25T16:35:46Z)
HiRE: High Recall Approximate Top-$k$ Estimation for Efficient LLM Inference [68.59839755875252]
HiRE comprises of two novel components: (i) a compression scheme to cheaply predict top-$k$ rows/columns with high recall, followed by full computation restricted to the predicted subset, and (ii) DA-TOP-$k$: an efficient multi-device approximate top-$k$ operator. We demonstrate that on a one billion parameter model, HiRE applied to both the softmax as well as feedforward layers, achieves almost matching pretraining and downstream accuracy, and speeds up inference latency by $1.47times$ on a single TPUv5e device.
arXiv Detail & Related papers (2024-02-14T18:04:36Z)
Stochastic Spiking Attention: Accelerating Attention with Stochastic Computing in Spiking Networks [33.51445486269896]
Spiking Neural Networks (SNNs) have been recently integrated into Transformer architectures due to their potential to reduce computational demands and to improve power efficiency. We propose a novel framework leveraging computing (SC) to effectively execute the dot-product attention for SNN-based Transformers.
arXiv Detail & Related papers (2024-02-14T11:47:19Z)
Transforming Image Super-Resolution: A ConvFormer-based Efficient Approach [63.98380888730723]
We introduce the Convolutional Transformer layer (ConvFormer) and the ConvFormer-based Super-Resolution network (CFSR) CFSR efficiently models long-range dependencies and extensive receptive fields with a slight computational cost. It achieves 0.39 dB gains on Urban100 dataset for x2 SR task while containing 26% and 31% fewer parameters and FLOPs, respectively.
arXiv Detail & Related papers (2024-01-11T03:08:00Z)
Instant Complexity Reduction in CNNs using Locality-Sensitive Hashing [50.79602839359522]
We propose HASTE (Hashing for Tractable Efficiency), a parameter-free and data-free module that acts as a plug-and-play replacement for any regular convolution module. We are able to drastically compress latent feature maps without sacrificing much accuracy by using locality-sensitive hashing (LSH) In particular, we are able to instantly drop 46.72% of FLOPs while only losing 1.25% accuracy by just swapping the convolution modules in a ResNet34 on CIFAR-10 for our HASTE module.
arXiv Detail & Related papers (2023-09-29T13:09:40Z)
Accurate, Low-latency, Efficient SAR Automatic Target Recognition on FPGA [3.251765107970636]
Synthetic aperture radar (SAR) automatic target recognition (ATR) is the key technique for remote-sensing image recognition. The state-of-the-art convolutional neural networks (CNNs) for SAR ATR suffer from emphhigh computation cost and emphlarge memory footprint. We propose a comprehensive GNN-based model-architecture co-design on FPGA to address the above issues.
arXiv Detail & Related papers (2023-01-04T05:35:30Z)
Accelerating Inference and Language Model Fusion of Recurrent Neural Network Transducers via End-to-End 4-bit Quantization [35.198615417316056]
We report on aggressive quantization strategies that greatly accelerate inference of Recurrent Neural Network Transducers (RNN-T) We use a 4 bit integer representation for both weights and activations and apply Quantization Aware Training (QAT) to retrain the full model. We show that customized quantization schemes that are tailored to the local properties of the network are essential to achieve good performance.
arXiv Detail & Related papers (2022-06-16T02:17:49Z)
An Adaptive Device-Edge Co-Inference Framework Based on Soft Actor-Critic [72.35307086274912]
High-dimension parameter model and large-scale mathematical calculation restrict execution efficiency, especially for Internet of Things (IoT) devices. We propose a new Deep Reinforcement Learning (DRL)-Soft Actor Critic for discrete (SAC-d), which generates the emphexit point, emphexit point, and emphcompressing bits by soft policy iterations. Based on the latency and accuracy aware reward design, such an computation can well adapt to the complex environment like dynamic wireless channel and arbitrary processing, and is capable of supporting the 5G URL
arXiv Detail & Related papers (2022-01-09T09:31:50Z)
NullaNet Tiny: Ultra-low-latency DNN Inference Through Fixed-function Combinational Logic [4.119948826527649]
Field-programmable gate array (FPGA)-based accelerators are gaining traction as a serious contender to replace graphics processing unit/central processing unit-based platforms. This paper presents NullaNet Tiny, a framework for constructing resource and energy-efficient, ultra-low-latency FPGA-based neural network accelerators.
arXiv Detail & Related papers (2021-04-07T00:16:39Z)
Fast convolutional neural networks on FPGAs with hls4ml [0.22756183402372013]
We introduce an automated tool for deploying ultra low-latency, low-power deep neural networks on FPGAs. We demonstrate how to achieve inference latency of $5,mu$s using convolutional architectures, while preserving state-of-the-art model performance.
arXiv Detail & Related papers (2021-01-13T14:47:11Z)
Efficient Integer-Arithmetic-Only Convolutional Neural Networks [87.01739569518513]
We replace conventional ReLU with Bounded ReLU and find that the decline is due to activation quantization. Our integer networks achieve equivalent performance as the corresponding FPN networks, but have only 1/4 memory cost and run 2x faster on modern GPU.
arXiv Detail & Related papers (2020-06-21T08:23:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.