Related papers: FPGA-QHAR: Throughput-Optimized for Quantized Human Action Recognition on The Edge

FPGA-QHAR: Throughput-Optimized for Quantized Human Action Recognition on The Edge

URL: http://arxiv.org/abs/2311.03390v1
Date: Sat, 4 Nov 2023 10:38:21 GMT
Title: FPGA-QHAR: Throughput-Optimized for Quantized Human Action Recognition on The Edge
Authors: Azzam Alhussain and Mingjie Lin
Abstract summary: This paper proposed an integrated end-to-end HAR scalable HW/SW accelerator co-design based on an enhanced 8-bit quantized Two-Stream SimpleNet-PyTorch CNN architecture. Our development uses partially streaming dataflow architecture to achieve higher throughput versus network design and resource utilization trade-off. Our proposed methodology achieved nearly 81% prediction accuracy with an approximately 24 FPS real-time inference throughput at 187MHz on ZCU104.
Score: 0.6254873489691849
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Accelerating Human Action Recognition (HAR) efficiently for real-time surveillance and robotic systems on edge chips remains a challenging research field, given its high computational and memory requirements. This paper proposed an integrated end-to-end HAR scalable HW/SW accelerator co-design based on an enhanced 8-bit quantized Two-Stream SimpleNet-PyTorch CNN architecture. Our network accelerator was trained on UCF101 and UCF24 datasets and implemented on edge SoC-FPGA. Our development uses partially streaming dataflow architecture to achieve higher throughput versus network design and resource utilization trade-off. We also fused all convolutional, batch-norm, and ReLU operations into a single homogeneous layer and utilized the Lucas-Kanade motion flow method to enable a high parallelism accelerator design and optimized on-chip engine computing.Furthermore, our proposed methodology achieved nearly 81% prediction accuracy with an approximately 24 FPS real-time inference throughput at 187MHz on ZCU104, which is 1.7x - 1.9x higher than the prior research. Lastly, the designed framework was benchmarked against several hardware chips for higher throughput and performance measurements and is now available as an open-source project on GitHub for training and implementation on edge platforms.

Related papers

Evolutionary Mapping of Neural Networks to Spatial Accelerators [64.13809409887254]
We introduce the first evolutionary, hardware-in-the-loop mapping framework for neuromorphic accelerators.<n>We evaluate our approach on Intel Loihi 2, a representative spatial accelerator featuring 152 cores in a 2D mesh.<n>Our method achieves up to 35% reduction in total latency compared to default cores on two sparse multi-layer perceptron networks.
arXiv Detail & Related papers (2026-02-04T16:28:08Z)
Real-Time Semantic Segmentation on FPGA for Autonomous Vehicles Using LMIINet with the CGRA4ML Framework [0.0]
We present an FPGA-based implementation of real-time semantic segmentation using the CGRA4ML hardware framework.<n>Our implementation achieves approximately 90% pixel accuracy and 45% mean Intersection-over-Union (mIoU) operating in real-time at 20 frames per second (FPS) with 50.1 ms latency on the ZCU104 FPGA board.
arXiv Detail & Related papers (2025-10-25T10:16:22Z)
LUTMUL: Exceed Conventional FPGA Roofline Limit by LUT-based Efficient Multiplication for Neural Network Inference [25.342107763021147]
This paper introduces LUTMUL, which harnesses the potential of look-up tables (LUTs) for performing multiplications. By exploiting this advantage of LUTs, our method demonstrates a potential boost in the performance of FPGA-based neural network accelerators.
arXiv Detail & Related papers (2024-11-01T02:54:11Z)
Hardware-Software Co-optimised Fast and Accurate Deep Reconfigurable Spiking Inference Accelerator Architecture Design Methodology [2.968768532937366]
Spiking Neural Networks (SNNs) have emerged as a promising approach to improve the energy efficiency of machine learning models. We develop a hardware-software co-optimisation strategy to port software-trained deep neural networks (DNN) to reduced-precision spiking models.
arXiv Detail & Related papers (2024-10-07T05:04:13Z)
Quasar-ViT: Hardware-Oriented Quantization-Aware Architecture Search for Vision Transformers [56.37495946212932]
Vision transformers (ViTs) have demonstrated their superior accuracy for computer vision tasks compared to convolutional neural networks (CNNs) This work proposes Quasar-ViT, a hardware-oriented quantization-aware architecture search framework for ViTs.
arXiv Detail & Related papers (2024-07-25T16:35:46Z)
TCCT-Net: Two-Stream Network Architecture for Fast and Efficient Engagement Estimation via Behavioral Feature Signals [58.865901821451295]
We present a novel two-stream feature fusion "Tensor-Convolution and Convolution-Transformer Network" (TCCT-Net) architecture. To better learn the meaningful patterns in the temporal-spatial domain, we design a "CT" stream that integrates a hybrid convolutional-transformer. In parallel, to efficiently extract rich patterns from the temporal-frequency domain, we introduce a "TC" stream that uses Continuous Wavelet Transform (CWT) to represent information in a 2D tensor form.
arXiv Detail & Related papers (2024-04-15T06:01:48Z)
Understanding the Potential of FPGA-Based Spatial Acceleration for Large Language Model Inference [11.614722231006695]
Large language models (LLMs) boasting billions of parameters have generated a significant demand for efficient deployment in inference workloads. This paper investigates the feasibility and potential of model-specific spatial acceleration for LLM inference on FPGAs.
arXiv Detail & Related papers (2023-12-23T04:27:06Z)
REED: Chiplet-Based Accelerator for Fully Homomorphic Encryption [4.713756093611972]
We present the first-of-its-kind multi-chiplet-based FHE accelerator REED' for overcoming the limitations of prior monolithic designs. Results demonstrate that REED 2.5D microprocessor consumes 96.7 mm$2$ chip area, 49.4 W average power in 7nm technology.
arXiv Detail & Related papers (2023-08-05T14:04:39Z)
DeepGEMM: Accelerated Ultra Low-Precision Inference on CPU Architectures using Lookup Tables [49.965024476651706]
DeepGEMM is a lookup table based approach for the execution of ultra low-precision convolutional neural networks on SIMD hardware. Our implementation outperforms corresponding 8-bit integer kernels by up to 1.74x on x86 platforms.
arXiv Detail & Related papers (2023-04-18T15:13:10Z)
Adaptable Butterfly Accelerator for Attention-based NNs via Hardware and Algorithm Co-design [66.39546326221176]
Attention-based neural networks have become pervasive in many AI tasks. The use of the attention mechanism and feed-forward network (FFN) demands excessive computational and memory resources. This paper proposes a hardware-friendly variant that adopts a unified butterfly sparsity pattern to approximate both the attention mechanism and the FFNs.
arXiv Detail & Related papers (2022-09-20T09:28:26Z)
Multi-Exit Semantic Segmentation Networks [78.44441236864057]
We propose a framework for converting state-of-the-art segmentation models to MESS networks. specially trained CNNs that employ parametrised early exits along their depth to save during inference on easier samples. We co-optimise the number, placement and architecture of the attached segmentation heads, along with the exit policy, to adapt to the device capabilities and application-specific requirements.
arXiv Detail & Related papers (2021-06-07T11:37:03Z)
FastFlowNet: A Lightweight Network for Fast Optical Flow Estimation [81.76975488010213]
Dense optical flow estimation plays a key role in many robotic vision tasks. Current networks often occupy large number of parameters and require heavy computation costs. Our proposed FastFlowNet works in the well-known coarse-to-fine manner with following innovations.
arXiv Detail & Related papers (2021-03-08T03:09:37Z)
A fully pipelined FPGA accelerator for scale invariant feature transform keypoint descriptor matching, [0.0]
We design a novel fully pipelined hardware accelerator architecture for SIFT keypoint descriptor matching. The proposed hardware architecture is able to properly handle the memory bandwidth necessary for a fully-pipelined implementation. Our hardware implementation is 15.7 times faster than the comparable software approach.
arXiv Detail & Related papers (2020-12-17T15:29:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.