LPYOLO: Low Precision YOLO for Face Detection on FPGA
- URL: http://arxiv.org/abs/2207.10482v1
- Date: Thu, 21 Jul 2022 13:54:52 GMT
- Title: LPYOLO: Low Precision YOLO for Face Detection on FPGA
- Authors: Bestami G\"unay, Sefa Burak Okcu and Hasan \c{S}akir Bilge
- Abstract summary: Face detection on surveillance systems is the most expected application on the security market.
TinyYolov3 architecture is redesigned and deployed for face detection.
Model is converted to an HLS based application with using FINN framework and FINN-HLS library.
- Score: 1.7188280334580197
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In recent years, number of edge computing devices and artificial intelligence
applications on them have advanced excessively. In edge computing, decision
making processes and computations are moved from servers to edge devices.
Hence, cheap and low power devices are required. FPGAs are very low power,
inclined to do parallel operations and deeply suitable devices for running
Convolutional Neural Networks (CNN) which are the fundamental unit of an
artificial intelligence application. Face detection on surveillance systems is
the most expected application on the security market. In this work, TinyYolov3
architecture is redesigned and deployed for face detection. It is a CNN based
object detection method and developed for embedded systems. PYNQ-Z2 is selected
as a target board which has low-end Xilinx Zynq 7020 System-on-Chip (SoC) on
it. Redesigned TinyYolov3 model is defined in numerous bit width precisions
with Brevitas library which brings fundamental CNN layers and activations in
integer quantized form. Then, the model is trained in a quantized structure
with WiderFace dataset. In order to decrease latency and power consumption,
onchip memory of the FPGA is configured as a storage of whole network
parameters and the last activation function is modified as rescaled HardTanh
instead of Sigmoid. Also, high degree of parallelism is applied to logical
resources of the FPGA. The model is converted to an HLS based application with
using FINN framework and FINN-HLS library which includes the layer definitions
in C++. Later, the model is synthesized and deployed. CPU of the SoC is
employed with multithreading mechanism and responsible for preprocessing,
postprocessing and TCP/IP streaming operations. Consequently, 2.4 Watt total
board power consumption, 18 Frames-Per-Second (FPS) throughput and 0.757 mAP
accuracy rate on Easy category of the WiderFace are achieved with 4 bits
precision model.
Related papers
- Energy-Aware FPGA Implementation of Spiking Neural Network with LIF Neurons [0.5243460995467893]
Spiking Neural Networks (SNNs) stand out as a cutting-edge solution for TinyML.
This paper presents a novel SNN architecture based on the 1st Order Leaky Integrate-and-Fire (LIF) neuron model.
A hardware-friendly LIF design is also proposed, and implemented on a Xilinx Artix-7 FPGA.
arXiv Detail & Related papers (2024-11-03T16:42:10Z) - AdaLog: Post-Training Quantization for Vision Transformers with Adaptive Logarithm Quantizer [54.713778961605115]
Vision Transformer (ViT) has become one of the most prevailing fundamental backbone networks in the computer vision community.
We propose a novel non-uniform quantizer, dubbed the Adaptive Logarithm AdaLog (AdaLog) quantizer.
arXiv Detail & Related papers (2024-07-17T18:38:48Z) - DeepGEMM: Accelerated Ultra Low-Precision Inference on CPU Architectures
using Lookup Tables [49.965024476651706]
DeepGEMM is a lookup table based approach for the execution of ultra low-precision convolutional neural networks on SIMD hardware.
Our implementation outperforms corresponding 8-bit integer kernels by up to 1.74x on x86 platforms.
arXiv Detail & Related papers (2023-04-18T15:13:10Z) - Towards Enabling Dynamic Convolution Neural Network Inference for Edge
Intelligence [0.0]
Recent advances in edge intelligence require CNN inference on edge network to increase throughput and reduce latency.
To provide flexibility, dynamic parameter allocation to different mobile devices is required to implement either a predefined or defined on-the-fly CNN architecture.
We propose a library-based approach to design scalable and dynamic distributed CNN inference on the fly.
arXiv Detail & Related papers (2022-02-18T22:33:42Z) - FPGA-optimized Hardware acceleration for Spiking Neural Networks [69.49429223251178]
This work presents the development of a hardware accelerator for an SNN, with off-line training, applied to an image recognition task.
The design targets a Xilinx Artix-7 FPGA, using in total around the 40% of the available hardware resources.
It reduces the classification time by three orders of magnitude, with a small 4.5% impact on the accuracy, if compared to its software, full precision counterpart.
arXiv Detail & Related papers (2022-01-18T13:59:22Z) - An Adaptive Device-Edge Co-Inference Framework Based on Soft
Actor-Critic [72.35307086274912]
High-dimension parameter model and large-scale mathematical calculation restrict execution efficiency, especially for Internet of Things (IoT) devices.
We propose a new Deep Reinforcement Learning (DRL)-Soft Actor Critic for discrete (SAC-d), which generates the emphexit point, emphexit point, and emphcompressing bits by soft policy iterations.
Based on the latency and accuracy aware reward design, such an computation can well adapt to the complex environment like dynamic wireless channel and arbitrary processing, and is capable of supporting the 5G URL
arXiv Detail & Related papers (2022-01-09T09:31:50Z) - Quantized Neural Networks via {-1, +1} Encoding Decomposition and
Acceleration [83.84684675841167]
We propose a novel encoding scheme using -1, +1 to decompose quantized neural networks (QNNs) into multi-branch binary networks.
We validate the effectiveness of our method on large-scale image classification, object detection, and semantic segmentation tasks.
arXiv Detail & Related papers (2021-06-18T03:11:15Z) - NullaNet Tiny: Ultra-low-latency DNN Inference Through Fixed-function
Combinational Logic [4.119948826527649]
Field-programmable gate array (FPGA)-based accelerators are gaining traction as a serious contender to replace graphics processing unit/central processing unit-based platforms.
This paper presents NullaNet Tiny, a framework for constructing resource and energy-efficient, ultra-low-latency FPGA-based neural network accelerators.
arXiv Detail & Related papers (2021-04-07T00:16:39Z) - Compressing deep neural networks on FPGAs to binary and ternary
precision with HLS4ML [13.325670094073383]
We present the implementation of binary and ternary neural networks in the hls4ml library.
We discuss the trade-off between model accuracy and resource consumption.
The binary and ternary implementation has similar performance to the higher precision implementation while using drastically fewer FPGA resources.
arXiv Detail & Related papers (2020-03-11T10:46:51Z) - PatDNN: Achieving Real-Time DNN Execution on Mobile Devices with
Pattern-based Weight Pruning [57.20262984116752]
We introduce a new dimension, fine-grained pruning patterns inside the coarse-grained structures, revealing a previously unknown point in design space.
With the higher accuracy enabled by fine-grained pruning patterns, the unique insight is to use the compiler to re-gain and guarantee high hardware efficiency.
arXiv Detail & Related papers (2020-01-01T04:52:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.