Related papers: Flexible and Fully Quantized Ultra-Lightweight TinyissimoYOLO for Ultra-Low-Power Edge Systems

Flexible and Fully Quantized Ultra-Lightweight TinyissimoYOLO for Ultra-Low-Power Edge Systems

URL: http://arxiv.org/abs/2307.05999v2
Date: Fri, 14 Jul 2023 04:32:40 GMT
Title: Flexible and Fully Quantized Ultra-Lightweight TinyissimoYOLO for Ultra-Low-Power Edge Systems
Authors: Julian Moosmann, Hanna Mueller, Nicky Zimmerman, Georg Rutishauser, Luca Benini, Michele Magno
Abstract summary: We deploy variants of TinyissimoYOLO on state-of-the-art ultra-low-power extreme edge platforms. This paper presents a comparison on latency, energy efficiency, and their ability to efficiently parallelize the workload.
Score: 13.266626571886354
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: This paper deploys and explores variants of TinyissimoYOLO, a highly flexible and fully quantized ultra-lightweight object detection network designed for edge systems with a power envelope of a few milliwatts. With experimental measurements, we present a comprehensive characterization of the network's detection performance, exploring the impact of various parameters, including input resolution, number of object classes, and hidden layer adjustments. We deploy variants of TinyissimoYOLO on state-of-the-art ultra-low-power extreme edge platforms, presenting an in-depth a comparison on latency, energy efficiency, and their ability to efficiently parallelize the workload. In particular, the paper presents a comparison between a novel parallel RISC-V processor (GAP9 from Greenwaves) with and without use of its on-chip hardware accelerator, an ARM Cortex-M7 core (STM32H7 from ST Microelectronics), two ARM Cortex-M4 cores (STM32L4 from STM and Apollo4b from Ambiq), and a multi-core platform with a CNN hardware accelerator (Analog Devices MAX78000). Experimental results show that the GAP9's hardware accelerator achieves the lowest inference latency and energy at 2.12ms and 150uJ respectively, which is around 2x faster and 20% more efficient than the next best platform, the MAX78000. The hardware accelerator of GAP9 can even run an increased resolution version of TinyissimoYOLO with 112x112 pixels and 10 detection classes within 3.2ms, consuming 245uJ. To showcase the competitiveness of a versatile general-purpose system we also deployed and profiled a multi-core implementation on GAP9 at different operating points, achieving 11.3ms with the lowest-latency and 490uJ with the most energy-efficient configuration. With this paper, we demonstrate the suitability and flexibility of TinyissimoYOLO on state-of-the-art detection datasets for real-time ultra-low-power edge inference.

Related papers

PowerYOLO: Mixed Precision Model for Hardware Efficient Object Detection with Event Data [0.5461938536945721]
PowerYOLO is a mixed precision solution to the problem of fitting algorithms of high memory and computational complexity into small low-power devices. First, we propose a system based on a Dynamic Vision Sensor (DVS), a novel sensor, that offers low power requirements. Second, to ensure high accuracy and low memory and computational complexity, we propose to use 4-bit width Powers-of-Two (PoT) quantisation. Third, we replace multiplication with bit-shifting to increase the efficiency of hardware acceleration of such solution.
arXiv Detail & Related papers (2024-07-11T08:17:35Z)
MODIPHY: Multimodal Obscured Detection for IoT using PHantom Convolution-Enabled Faster YOLO [10.183459286746196]
We introduce YOLO Phantom, one of the smallest YOLO models ever conceived. YOLO Phantom achieves comparable accuracy to the latest YOLOv8n model while simultaneously reducing both parameters and model size. Its real-world efficacy is demonstrated on an IoT platform with advanced low-light and RGB cameras, seamlessly connecting to an AWS-based notification endpoint.
arXiv Detail & Related papers (2024-02-12T18:56:53Z)
Enhancing Energy-efficiency by Solving the Throughput Bottleneck of LSTM Cells for Embedded FPGAs [22.293462679874008]
This work proposes a novel LSTM cell optimisation aimed at energy-efficient inference on end devices. It achieves at least 5.4$times$ faster throughput and 1.37$times$ more energy efficient than existing approaches.
arXiv Detail & Related papers (2023-10-04T08:42:10Z)
EpiDeNet: An Energy-Efficient Approach to Seizure Detection for Embedded Systems [9.525786920713763]
This paper introduces EpiDeNet, a new lightweight seizure detection network. Sensitivity-Specificity Weighted Cross-Entropy (SSWCE) is a new loss function that incorporates sensitivity and specificity. A three-window majority voting-based smoothing scheme combined with the SSWCE loss achieves 3x reduction of false positives to 1.18 FP/h.
arXiv Detail & Related papers (2023-08-28T11:29:51Z)
Edge-MoE: Memory-Efficient Multi-Task Vision Transformer Architecture with Task-level Sparsity via Mixture-of-Experts [60.1586169973792]
M$3$ViT is the latest multi-task ViT model that introduces mixture-of-experts (MoE) MoE achieves better accuracy and over 80% reduction computation but leaves challenges for efficient deployment on FPGA. Our work, dubbed Edge-MoE, solves the challenges to introduce the first end-to-end FPGA accelerator for multi-task ViT with a collection of architectural innovations.
arXiv Detail & Related papers (2023-05-30T02:24:03Z)
Ultra-low Power Deep Learning-based Monocular Relative Localization Onboard Nano-quadrotors [64.68349896377629]
This work presents a novel autonomous end-to-end system that addresses the monocular relative localization, through deep neural networks (DNNs), of two peer nano-drones. To cope with the ultra-constrained nano-drone platform, we propose a vertically-integrated framework, including dataset augmentation, quantization, and system optimizations. Experimental results show that our DNN can precisely localize a 10cm-size target nano-drone by employing only low-resolution monochrome images, up to 2m distance.
arXiv Detail & Related papers (2023-03-03T14:14:08Z)
High-Throughput, High-Performance Deep Learning-Driven Light Guide Plate Surface Visual Quality Inspection Tailored for Real-World Manufacturing Environments [75.66288398180525]
Light guide plates are essential optical components widely used in a diverse range of applications ranging from medical lighting fixtures to back-lit TV displays. In this work, we introduce a fully-integrated, high-performance deep learning-driven workflow for light guide plate surface visual quality inspection (VQI) tailored for real-world manufacturing environments. To enable automated VQI on the edge computing within the fully-integrated VQI system, a highly compact deep anti-aliased attention condenser neural network (which we name LightDefectNet) was created. Experiments show that LightDetectNet achieves a detection accuracy
arXiv Detail & Related papers (2022-12-20T20:11:11Z)
Data-Model-Circuit Tri-Design for Ultra-Light Video Intelligence on Edge Devices [90.30316433184414]
We propose a data-model-hardware tri-design framework for high- throughput, low-cost, and high-accuracy MOT on HD video stream. Compared to the state-of-the-art MOT baseline, our tri-design approach can achieve 12.5x latency reduction, 20.9x effective frame rate improvement, 5.83x lower power, and 9.78x better energy efficiency, without much accuracy drop.
arXiv Detail & Related papers (2022-10-16T16:21:40Z)
EfficientViT: Multi-Scale Linear Attention for High-Resolution Dense Prediction [67.11722682878722]
This work presents EfficientViT, a new family of high-resolution vision models with novel multi-scale linear attention. Our multi-scale linear attention achieves the global receptive field and multi-scale learning. EfficientViT delivers remarkable performance gains over previous state-of-the-art models.
arXiv Detail & Related papers (2022-05-29T20:07:23Z)
Revisiting Multi-Scale Feature Fusion for Semantic Segmentation [90.32746095413447]
In this paper, we demonstrate that neither high internal resolution nor atrous convolutions are necessary for accurate semantic segmentation. We develop a simplified segmentation model, named ESeg, which has neither high internal resolution nor expensive atrous convolutions. Our simple method can achieve better accuracy with faster speed than prior art across multiple datasets.
arXiv Detail & Related papers (2022-03-23T19:14:11Z)
FantastIC4: A Hardware-Software Co-Design Approach for Efficiently Running 4bit-Compact Multilayer Perceptrons [19.411734658680967]
We propose a software-hardware optimization paradigm for obtaining a highly efficient execution engine of deep neural networks (DNNs) Our approach is centred around compression as a means for reducing the area as well as power requirements of, concretely, multilayer perceptrons (MLPs) with high predictive performances. We show that we can achieve throughputs of 2.45 TOPS with a total power consumption of 3.6W on a Virtual Ultrascale FPGA XCVU440 device implementation, and achieve a total power efficiency of 20.17 TOPS/W on a 22nm process ASIC version.
arXiv Detail & Related papers (2020-12-17T19:10:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.