Evaluation of Convolution Primitives for Embedded Neural Networks on
32-bit Microcontrollers
- URL: http://arxiv.org/abs/2303.10702v1
- Date: Sun, 19 Mar 2023 16:17:19 GMT
- Title: Evaluation of Convolution Primitives for Embedded Neural Networks on
32-bit Microcontrollers
- Authors: Baptiste Nguyen, Pierre-Alain Moellic, Sylvain Blayac
- Abstract summary: We propose an implementation for ARM Cortex-M processor family with an open source deployment platform (NNoM)
Our benchmark reveals a linear relationship between theoretical MACs and energy consumption.
We discuss about the significant reduction in latency and energy consumption due to the use of SIMD instructions.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Deploying neural networks on constrained hardware platforms such as 32-bit
microcontrollers is a challenging task because of the large memory, computing
and energy requirements of their inference process. To tackle these issues,
several convolution primitives have been proposed to make the standard
convolution more computationally efficient. However, few of these primitives
are really implemented for 32-bit microcontrollers. In this work, we collect
different state-of-the-art convolutional primitives and propose an
implementation for ARM Cortex-M processor family with an open source deployment
platform (NNoM). Then, we carry out experimental characterization tests on
these implementations. Our benchmark reveals a linear relationship between
theoretical MACs and energy consumption. Thus showing the advantages of using
computationally efficient primitives like shift convolution. We discuss about
the significant reduction in latency and energy consumption due to the use of
SIMD instructions and highlight the importance of data reuse in those
performance gains. For reproducibility purpose and further experiments, codes
and experiments are publicly available.
Related papers
- Accelerating TinyML Inference on Microcontrollers through Approximate Kernels [3.566060656925169]
In this work, we combine approximate computing and software kernel design to accelerate the inference of approximate CNN models on microcontrollers.
Our evaluation on an STM32-Nucleo board and 2 popular CNNs trained on the CIFAR-10 dataset shows that, compared to state-of-the-art exact inference, our solutions can feature on average 21% latency reduction.
arXiv Detail & Related papers (2024-09-25T11:10:33Z) - DeepGEMM: Accelerated Ultra Low-Precision Inference on CPU Architectures
using Lookup Tables [49.965024476651706]
DeepGEMM is a lookup table based approach for the execution of ultra low-precision convolutional neural networks on SIMD hardware.
Our implementation outperforms corresponding 8-bit integer kernels by up to 1.74x on x86 platforms.
arXiv Detail & Related papers (2023-04-18T15:13:10Z) - Energy-efficient Task Adaptation for NLP Edge Inference Leveraging
Heterogeneous Memory Architectures [68.91874045918112]
adapter-ALBERT is an efficient model optimization for maximal data reuse across different tasks.
We demonstrate the advantage of mapping the model to a heterogeneous on-chip memory architecture by performing simulations on a validated NLP edge accelerator.
arXiv Detail & Related papers (2023-03-25T14:40:59Z) - Human Activity Recognition on Microcontrollers with Quantized and
Adaptive Deep Neural Networks [10.195581493173643]
Human Activity Recognition (HAR) based on inertial data is an increasingly diffused task on embedded devices.
Most embedded HAR systems are based on simple and not-so-accurate classic machine learning algorithms.
This work proposes a set of efficient one-dimensional Convolutional Neural Networks (CNNs) deployable on general purpose microcontrollers (MCUs)
arXiv Detail & Related papers (2022-09-02T06:32:11Z) - Keyword Spotting System and Evaluation of Pruning and Quantization
Methods on Low-power Edge Microcontrollers [7.570300579676175]
Keywords spotting (KWS) is beneficial for voice-based user interactions with low-power devices at the edge.
This paper shows our small-footprint KWS system running on STM32F7 microcontroller with Cortex-M7 core @216MHz and 512KB static RAM.
arXiv Detail & Related papers (2022-08-04T16:49:45Z) - MAPLE: Microprocessor A Priori for Latency Estimation [81.91509153539566]
Modern deep neural networks must demonstrate state-of-the-art accuracy while exhibiting low latency and energy consumption.
Measuring the latency of every evaluated architecture adds a significant amount of time to the NAS process.
We propose Microprocessor A Priori for Estimation Estimation MAPLE that does not rely on transfer learning or domain adaptation.
arXiv Detail & Related papers (2021-11-30T03:52:15Z) - Efficient Micro-Structured Weight Unification and Pruning for Neural
Network Compression [56.83861738731913]
Deep Neural Network (DNN) models are essential for practical applications, especially for resource limited devices.
Previous unstructured or structured weight pruning methods can hardly truly accelerate inference.
We propose a generalized weight unification framework at a hardware compatible micro-structured level to achieve high amount of compression and acceleration.
arXiv Detail & Related papers (2021-06-15T17:22:59Z) - Quantization and Deployment of Deep Neural Networks on Microcontrollers [0.0]
This work focuses on quantization and deployment of deep neural networks onto low-power 32-bit microcontrollers.
A new framework for end-to-end deep neural networks training, quantization and deployment is presented.
Execution using single precision 32-bit floating-point as well as fixed-point on 8- and 16-bit integers are supported.
arXiv Detail & Related papers (2021-05-27T17:39:06Z) - Ps and Qs: Quantization-aware pruning for efficient low latency neural
network inference [56.24109486973292]
We study the interplay between pruning and quantization during the training of neural networks for ultra low latency applications.
We find that quantization-aware pruning yields more computationally efficient models than either pruning or quantization alone for our task.
arXiv Detail & Related papers (2021-02-22T19:00:05Z) - Efficient Neural Network Deployment for Microcontroller [0.0]
This paper is going to explore and generalize convolution neural network deployment for microcontrollers.
The memory savings and performance will be compared with CMSIS-NN framework developed for ARM Cortex-M CPUs.
The final purpose is to develop a tool consuming PyTorch model with trained network weights, and it turns into an optimized inference engine in C/C++ for low memory(kilobyte level) and limited computing capable microcontrollers.
arXiv Detail & Related papers (2020-07-02T19:21:05Z) - On Coresets for Support Vector Machines [61.928187390362176]
A coreset is a small, representative subset of the original data points.
We show that our algorithm can be used to extend the applicability of any off-the-shelf SVM solver to streaming, distributed, and dynamic data settings.
arXiv Detail & Related papers (2020-02-15T23:25:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.