FPGA-QHAR: Throughput-Optimized for Quantized Human Action Recognition
on The Edge
- URL: http://arxiv.org/abs/2311.03390v1
- Date: Sat, 4 Nov 2023 10:38:21 GMT
- Title: FPGA-QHAR: Throughput-Optimized for Quantized Human Action Recognition
on The Edge
- Authors: Azzam Alhussain and Mingjie Lin
- Abstract summary: This paper proposed an integrated end-to-end HAR scalable HW/SW accelerator co-design based on an enhanced 8-bit quantized Two-Stream SimpleNet-PyTorch CNN architecture.
Our development uses partially streaming dataflow architecture to achieve higher throughput versus network design and resource utilization trade-off.
Our proposed methodology achieved nearly 81% prediction accuracy with an approximately 24 FPS real-time inference throughput at 187MHz on ZCU104.
- Score: 0.6254873489691849
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Accelerating Human Action Recognition (HAR) efficiently for real-time
surveillance and robotic systems on edge chips remains a challenging research
field, given its high computational and memory requirements. This paper
proposed an integrated end-to-end HAR scalable HW/SW accelerator co-design
based on an enhanced 8-bit quantized Two-Stream SimpleNet-PyTorch CNN
architecture. Our network accelerator was trained on UCF101 and UCF24 datasets
and implemented on edge SoC-FPGA. Our development uses partially streaming
dataflow architecture to achieve higher throughput versus network design and
resource utilization trade-off. We also fused all convolutional, batch-norm,
and ReLU operations into a single homogeneous layer and utilized the
Lucas-Kanade motion flow method to enable a high parallelism accelerator design
and optimized on-chip engine computing.Furthermore, our proposed methodology
achieved nearly 81% prediction accuracy with an approximately 24 FPS real-time
inference throughput at 187MHz on ZCU104, which is 1.7x - 1.9x higher than the
prior research. Lastly, the designed framework was benchmarked against several
hardware chips for higher throughput and performance measurements and is now
available as an open-source project on GitHub for training and implementation
on edge platforms.
Related papers
- LUTMUL: Exceed Conventional FPGA Roofline Limit by LUT-based Efficient Multiplication for Neural Network Inference [25.342107763021147]
This paper introduces LUTMUL, which harnesses the potential of look-up tables (LUTs) for performing multiplications.
By exploiting this advantage of LUTs, our method demonstrates a potential boost in the performance of FPGA-based neural network accelerators.
arXiv Detail & Related papers (2024-11-01T02:54:11Z) - Hardware-Software Co-optimised Fast and Accurate Deep Reconfigurable Spiking Inference Accelerator Architecture Design Methodology [2.968768532937366]
Spiking Neural Networks (SNNs) have emerged as a promising approach to improve the energy efficiency of machine learning models.
We develop a hardware-software co-optimisation strategy to port software-trained deep neural networks (DNN) to reduced-precision spiking models.
arXiv Detail & Related papers (2024-10-07T05:04:13Z) - Quasar-ViT: Hardware-Oriented Quantization-Aware Architecture Search for Vision Transformers [56.37495946212932]
Vision transformers (ViTs) have demonstrated their superior accuracy for computer vision tasks compared to convolutional neural networks (CNNs)
This work proposes Quasar-ViT, a hardware-oriented quantization-aware architecture search framework for ViTs.
arXiv Detail & Related papers (2024-07-25T16:35:46Z) - TCCT-Net: Two-Stream Network Architecture for Fast and Efficient Engagement Estimation via Behavioral Feature Signals [58.865901821451295]
We present a novel two-stream feature fusion "Tensor-Convolution and Convolution-Transformer Network" (TCCT-Net) architecture.
To better learn the meaningful patterns in the temporal-spatial domain, we design a "CT" stream that integrates a hybrid convolutional-transformer.
In parallel, to efficiently extract rich patterns from the temporal-frequency domain, we introduce a "TC" stream that uses Continuous Wavelet Transform (CWT) to represent information in a 2D tensor form.
arXiv Detail & Related papers (2024-04-15T06:01:48Z) - Understanding the Potential of FPGA-Based Spatial Acceleration for Large Language Model Inference [11.614722231006695]
Large language models (LLMs) boasting billions of parameters have generated a significant demand for efficient deployment in inference workloads.
This paper investigates the feasibility and potential of model-specific spatial acceleration for LLM inference on FPGAs.
arXiv Detail & Related papers (2023-12-23T04:27:06Z) - REED: Chiplet-Based Accelerator for Fully Homomorphic Encryption [4.713756093611972]
We present the first-of-its-kind multi-chiplet-based FHE accelerator REED' for overcoming the limitations of prior monolithic designs.
Results demonstrate that REED 2.5D microprocessor consumes 96.7 mm$2$ chip area, 49.4 W average power in 7nm technology.
arXiv Detail & Related papers (2023-08-05T14:04:39Z) - DeepGEMM: Accelerated Ultra Low-Precision Inference on CPU Architectures
using Lookup Tables [49.965024476651706]
DeepGEMM is a lookup table based approach for the execution of ultra low-precision convolutional neural networks on SIMD hardware.
Our implementation outperforms corresponding 8-bit integer kernels by up to 1.74x on x86 platforms.
arXiv Detail & Related papers (2023-04-18T15:13:10Z) - Adaptable Butterfly Accelerator for Attention-based NNs via Hardware and
Algorithm Co-design [66.39546326221176]
Attention-based neural networks have become pervasive in many AI tasks.
The use of the attention mechanism and feed-forward network (FFN) demands excessive computational and memory resources.
This paper proposes a hardware-friendly variant that adopts a unified butterfly sparsity pattern to approximate both the attention mechanism and the FFNs.
arXiv Detail & Related papers (2022-09-20T09:28:26Z) - Multi-Exit Semantic Segmentation Networks [78.44441236864057]
We propose a framework for converting state-of-the-art segmentation models to MESS networks.
specially trained CNNs that employ parametrised early exits along their depth to save during inference on easier samples.
We co-optimise the number, placement and architecture of the attached segmentation heads, along with the exit policy, to adapt to the device capabilities and application-specific requirements.
arXiv Detail & Related papers (2021-06-07T11:37:03Z) - FastFlowNet: A Lightweight Network for Fast Optical Flow Estimation [81.76975488010213]
Dense optical flow estimation plays a key role in many robotic vision tasks.
Current networks often occupy large number of parameters and require heavy computation costs.
Our proposed FastFlowNet works in the well-known coarse-to-fine manner with following innovations.
arXiv Detail & Related papers (2021-03-08T03:09:37Z) - A fully pipelined FPGA accelerator for scale invariant feature transform
keypoint descriptor matching, [0.0]
We design a novel fully pipelined hardware accelerator architecture for SIFT keypoint descriptor matching.
The proposed hardware architecture is able to properly handle the memory bandwidth necessary for a fully-pipelined implementation.
Our hardware implementation is 15.7 times faster than the comparable software approach.
arXiv Detail & Related papers (2020-12-17T15:29:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.