Evaluating the Energy Efficiency of NPU-Accelerated Machine Learning Inference on Embedded Microcontrollers
- URL: http://arxiv.org/abs/2509.17533v1
- Date: Mon, 22 Sep 2025 08:52:54 GMT
- Title: Evaluating the Energy Efficiency of NPU-Accelerated Machine Learning Inference on Embedded Microcontrollers
- Authors: Anastasios Fanariotis, Theofanis Orphanoudakis, Vasilis Fotopoulos,
- Abstract summary: This paper evaluates the impact of Neural Processing Units (NPUs) on machine learning (ML) execution on microcontrollers (MCUs)<n>It shows substantial efficiency gains when inference is offloaded to the NPU.<n>For moderate to large networks, latency improvements ranged from 7x to over 125x, with per-inference net energy reductions up to 143x.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The deployment of machine learning (ML) models on microcontrollers (MCUs) is constrained by strict energy, latency, and memory requirements, particularly in battery-operated and real-time edge devices. While software-level optimizations such as quantization and pruning reduce model size and computation, hardware acceleration has emerged as a decisive enabler for efficient embedded inference. This paper evaluates the impact of Neural Processing Units (NPUs) on MCU-based ML execution, using the ARM Cortex-M55 core combined with the Ethos-U55 NPU on the Alif Semiconductor Ensemble E7 development board as a representative platform. A rigorous measurement methodology was employed, incorporating per-inference net energy accounting via GPIO-triggered high-resolution digital multimeter synchronization and idle-state subtraction, ensuring accurate attribution of energy costs. Experimental results across six representative ML models -including MiniResNet, MobileNetV2, FD-MobileNet, MNIST, TinyYolo, and SSD-MobileNet- demonstrate substantial efficiency gains when inference is offloaded to the NPU. For moderate to large networks, latency improvements ranged from 7x to over 125x, with per-inference net energy reductions up to 143x. Notably, the NPU enabled execution of models unsupported on CPU-only paths, such as SSD-MobileNet, highlighting its functional as well as efficiency advantages. These findings establish NPUs as a cornerstone of energy-aware embedded AI, enabling real-time, power-constrained ML inference at the MCU level.
Related papers
- NanoCockpit: Performance-optimized Application Framework for AI-based Autonomous Nanorobotics [50.594459728605734]
Small form factor, i.e., a few 10s grams, severely limits onboard computational resources to sub-SI100milliwatt microcontroller units (MCUs)<n>Our framework achieves ideal end-to-end latency, i.e. zero overhead due to serialized tasks, delivering quantifiable improvements in closed-loop control performance.
arXiv Detail & Related papers (2026-01-12T12:29:38Z) - Efficient Deployment of CNN Models on Multiple In-Memory Computing Units [0.0]
In-Memory Computing (IMC) represents a paradigm shift in deep learning acceleration.<n>We introduce the Load-Balance-Longest-Path (LBLP) algorithm for maximizing the processing rate and minimizing latency due to efficient resources utilization.
arXiv Detail & Related papers (2025-10-09T14:03:32Z) - End-to-End Efficiency in Keyword Spotting: A System-Level Approach for Embedded Microcontrollers [0.18472148461613155]
Keywords spotting (KWS) is a key enabling technology for hands-free interaction in embedded and IoT devices, where stringent memory and energy constraints challenge the deployment of AI-enabeld devices.<n>In this work, we evaluate and compare several state-of-the-art lightweight neural network architectures, including DS-CNN, LiCoNet, and TENet, alongside our proposed Typman-KWS architecture built upon MobileNet, specifically designed for efficient KWS on microcontroller units (MCUs)<n>Our results show that TKWS with three residual blocks achieves up to 92.4% F1-score with only 14.4k parameters
arXiv Detail & Related papers (2025-09-08T16:01:55Z) - Intra-DP: A High Performance Collaborative Inference System for Mobile Edge Computing [67.98609858326951]
Intra-DP is a high-performance collaborative inference system optimized for deep neural networks (DNNs) on mobile devices.<n>It reduces per-inference latency by up to 50% and energy consumption by up to 75% compared to state-of-the-art baselines.<n>The evaluation demonstrates that Intra-DP reduces per-inference latency by up to 50% and energy consumption by up to 75% compared to state-of-the-art baselines.
arXiv Detail & Related papers (2025-07-08T09:50:57Z) - Benchmarking Energy and Latency in TinyML: A Novel Method for Resource-Constrained AI [0.0]
This work introduces an alternative benchmarking methodology that integrates energy and latency measurements.<n>To evaluate our setup, we tested the STM32N6 MCU, which includes a NPU for executing neural networks.<n>Our findings demonstrate that reducing the core voltage and clock frequency improve the efficiency of pre- and post-processing.
arXiv Detail & Related papers (2025-05-21T15:12:14Z) - On-Sensor Convolutional Neural Networks with Early-Exits [3.916521228619074]
We introduce for the first time in the literature the optimized design and implementation of Depth-First CNNs operating on the Intelligent Sensor Processing Unit (ISPU) within an Inertial Measurement Unit (IMU) by STMicroelectronics.<n>Our approach partitions the CNN between the ISPU and the microcontroller (MCU) and employs an Early-Exit mechanism to stop the computations on the IMU when enough confidence about the results is achieved.
arXiv Detail & Related papers (2025-03-21T08:31:07Z) - Energy-Aware FPGA Implementation of Spiking Neural Network with LIF Neurons [0.5243460995467893]
Spiking Neural Networks (SNNs) stand out as a cutting-edge solution for TinyML.
This paper presents a novel SNN architecture based on the 1st Order Leaky Integrate-and-Fire (LIF) neuron model.
A hardware-friendly LIF design is also proposed, and implemented on a Xilinx Artix-7 FPGA.
arXiv Detail & Related papers (2024-11-03T16:42:10Z) - EPS-MoE: Expert Pipeline Scheduler for Cost-Efficient MoE Inference [49.94169109038806]
This paper introduces EPS-MoE, a novel expert pipeline scheduler for MoE that surpasses the existing parallelism schemes.<n>Our results demonstrate at most 52.4% improvement in prefill throughput compared to existing parallel inference methods.
arXiv Detail & Related papers (2024-10-16T05:17:49Z) - EPIM: Efficient Processing-In-Memory Accelerators based on Epitome [78.79382890789607]
We introduce the Epitome, a lightweight neural operator offering convolution-like functionality.
On the software side, we evaluate epitomes' latency and energy on PIM accelerators.
We introduce a PIM-aware layer-wise design method to enhance their hardware efficiency.
arXiv Detail & Related papers (2023-11-12T17:56:39Z) - A Multi-Head Ensemble Multi-Task Learning Approach for Dynamical
Computation Offloading [62.34538208323411]
We propose a multi-head ensemble multi-task learning (MEMTL) approach with a shared backbone and multiple prediction heads (PHs)
MEMTL outperforms benchmark methods in both the inference accuracy and mean square error without requiring additional training data.
arXiv Detail & Related papers (2023-09-02T11:01:16Z) - DeepGEMM: Accelerated Ultra Low-Precision Inference on CPU Architectures
using Lookup Tables [49.965024476651706]
DeepGEMM is a lookup table based approach for the execution of ultra low-precision convolutional neural networks on SIMD hardware.
Our implementation outperforms corresponding 8-bit integer kernels by up to 1.74x on x86 platforms.
arXiv Detail & Related papers (2023-04-18T15:13:10Z) - An Adaptive Device-Edge Co-Inference Framework Based on Soft
Actor-Critic [72.35307086274912]
High-dimension parameter model and large-scale mathematical calculation restrict execution efficiency, especially for Internet of Things (IoT) devices.
We propose a new Deep Reinforcement Learning (DRL)-Soft Actor Critic for discrete (SAC-d), which generates the emphexit point, emphexit point, and emphcompressing bits by soft policy iterations.
Based on the latency and accuracy aware reward design, such an computation can well adapt to the complex environment like dynamic wireless channel and arbitrary processing, and is capable of supporting the 5G URL
arXiv Detail & Related papers (2022-01-09T09:31:50Z) - Q-EEGNet: an Energy-Efficient 8-bit Quantized Parallel EEGNet
Implementation for Edge Motor-Imagery Brain--Machine Interfaces [16.381467082472515]
Motor-Imagery Brain--Machine Interfaces (MI-BMIs)promise direct and accessible communication between human brains and machines.
Deep learning models have emerged for classifying EEG signals.
These models often exceed the limitations of edge devices due to their memory and computational requirements.
arXiv Detail & Related papers (2020-04-24T12:29:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.