Related papers: Exploring the Boundaries of On-Device Inference: When Tiny Falls Short, Go Hierarchical

Exploring the Boundaries of On-Device Inference: When Tiny Falls Short, Go Hierarchical

URL: http://arxiv.org/abs/2407.11061v1
Date: Wed, 10 Jul 2024 16:05:43 GMT
Title: Exploring the Boundaries of On-Device Inference: When Tiny Falls Short, Go Hierarchical
Authors: Adarsh Prasad Behera, Paulius Daubaris, Iñaki Bravo, José Gallego, Roberto Morabito, Joerg Widmer, Jaya Prakash Varma Champati,
Abstract summary: Hierarchical Inference (HI) system offloads selected samples to an edge server or cloud for remote ML inference. This paper systematically compares the performance of HI with on-device inference based on measurements of accuracy, latency, and energy.
Score: 4.211747495359569
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: On-device inference holds great potential for increased energy efficiency, responsiveness, and privacy in edge ML systems. However, due to less capable ML models that can be embedded in resource-limited devices, use cases are limited to simple inference tasks such as visual keyword spotting, gesture recognition, and predictive analytics. In this context, the Hierarchical Inference (HI) system has emerged as a promising solution that augments the capabilities of the local ML by offloading selected samples to an edge server or cloud for remote ML inference. Existing works demonstrate through simulation that HI improves accuracy. However, they do not account for the latency and energy consumption on the device, nor do they consider three key heterogeneous dimensions that characterize ML systems: hardware, network connectivity, and models. In contrast, this paper systematically compares the performance of HI with on-device inference based on measurements of accuracy, latency, and energy for running embedded ML models on five devices with different capabilities and three image classification datasets. For a given accuracy requirement, the HI systems we designed achieved up to 73% lower latency and up to 77% lower device energy consumption than an on-device inference system. The key to building an efficient HI system is the availability of small-size, reasonably accurate on-device models whose outputs can be effectively differentiated for samples that require remote inference. Despite the performance gains, HI requires on-device inference for all samples, which adds a fixed overhead to its latency and energy consumption. Therefore, we design a hybrid system, Early Exit with HI (EE-HI), and demonstrate that compared to HI, EE-HI reduces the latency by up to 59.7% and lowers the device's energy consumption by up to 60.4%.

Related papers

Dynamic Meta-Ensemble Framework for Efficient and Accurate Deep Learning in Plant Leaf Disease Detection on Resource-Constrained Edge Devices [0.0]
We introduce a novel Dynamic Meta-Enemble Framework (DMEF) for high-accuracy plant disease diagnosis under resource constraints.<n>DMEF employs an adaptive weighting mechanism that dynamically combines the predictions of three lightweight convolutional neural networks.<n>Experiments on benchmark datasets for potato and maize diseases demonstrate state-of-the-art classification accuracies of 99.53% and 96.61%, respectively.
arXiv Detail & Related papers (2026-01-24T03:57:49Z)
End-to-End Efficiency in Keyword Spotting: A System-Level Approach for Embedded Microcontrollers [0.18472148461613155]
Keywords spotting (KWS) is a key enabling technology for hands-free interaction in embedded and IoT devices, where stringent memory and energy constraints challenge the deployment of AI-enabeld devices.<n>In this work, we evaluate and compare several state-of-the-art lightweight neural network architectures, including DS-CNN, LiCoNet, and TENet, alongside our proposed Typman-KWS architecture built upon MobileNet, specifically designed for efficient KWS on microcontroller units (MCUs)<n>Our results show that TKWS with three residual blocks achieves up to 92.4% F1-score with only 14.4k parameters
arXiv Detail & Related papers (2025-09-08T16:01:55Z)
Benchmarking Energy and Latency in TinyML: A Novel Method for Resource-Constrained AI [0.0]
This work introduces an alternative benchmarking methodology that integrates energy and latency measurements.<n>To evaluate our setup, we tested the STM32N6 MCU, which includes a NPU for executing neural networks.<n>Our findings demonstrate that reducing the core voltage and clock frequency improve the efficiency of pre- and post-processing.
arXiv Detail & Related papers (2025-05-21T15:12:14Z)
EfficientLLM: Efficiency in Large Language Models [64.3537131208038]
Large Language Models (LLMs) have driven significant progress, yet their growing counts and context windows incur prohibitive compute, energy, and monetary costs.<n>We introduce EfficientLLM, a novel benchmark and the first comprehensive empirical study evaluating efficiency techniques for LLMs at scale.
arXiv Detail & Related papers (2025-05-20T02:27:08Z)
The Larger the Merrier? Efficient Large AI Model Inference in Wireless Edge Networks [56.37880529653111]
The demand for large computation model (LAIM) services is driving a paradigm shift from traditional cloud-based inference to edge-based inference for low-latency, privacy-preserving applications.<n>In this paper, we investigate the LAIM-inference scheme, where a pre-trained LAIM is pruned and partitioned into on-device and on-server sub-models for deployment.
arXiv Detail & Related papers (2025-05-14T08:18:55Z)
Hybrid Knowledge Transfer through Attention and Logit Distillation for On-Device Vision Systems in Agricultural IoT [0.0]
This work advances real-time, energy-efficient crop monitoring in precision agriculture. It demonstrates how we can attain ViT-level diagnostic precision on edge devices.
arXiv Detail & Related papers (2025-04-21T06:56:41Z)
Task-Oriented Feature Compression for Multimodal Understanding via Device-Edge Co-Inference [49.77734021302196]
We propose a task-oriented feature compression (TOFC) method for multimodal understanding in a device-edge co-inference framework. To enhance compression efficiency, multiple entropy models are adaptively selected based on the characteristics of the visual features. Results show that TOFC achieves up to 60% reduction in data transmission overhead and 50% reduction in system latency.
arXiv Detail & Related papers (2025-03-17T08:37:22Z)
CEReBrO: Compact Encoder for Representations of Brain Oscillations Using Efficient Alternating Attention [53.539020807256904]
We introduce a Compact for Representations of Brain Oscillations using alternating attention (CEReBrO) Our tokenization scheme represents EEG signals at a per-channel patch. We propose an alternating attention mechanism that jointly models intra-channel temporal dynamics and inter-channel spatial correlations, achieving 2x speed improvement with 6x less memory required compared to standard self-attention.
arXiv Detail & Related papers (2025-01-18T21:44:38Z)
Enhancing Predictive Maintenance in Mining Mobile Machinery through a TinyML-enabled Hierarchical Inference Network [0.0]
This paper introduces the Edge Sensor Network for Predictive Maintenance (ESN-PdM) ESN-PdM is a hierarchical inference framework across edge devices, gateways, and cloud services for real-time condition monitoring. System dynamically adjusts inference locations--on-device, on-gateway, or on-cloud--based on trade-offs among accuracy, latency, and battery life.
arXiv Detail & Related papers (2024-11-11T17:48:04Z)
DSORT-MCU: Detecting Small Objects in Real-Time on Microcontroller Units [1.4447019135112429]
This paper proposes an adaptive tiling method for lightweight and energy-efficient object detection networks, including YOLO-based models and the popular FOMO network. The proposed tiling enables object detection on low-power MCUs with no compromise on accuracy compared to large-scale detection models.
arXiv Detail & Related papers (2024-10-22T07:37:47Z)
Efficient Federated Intrusion Detection in 5G ecosystem using optimized BERT-based model [0.7100520098029439]
5G offers advanced services, supporting applications such as intelligent transportation, connected healthcare, and smart cities within the Internet of Things (IoT) These advancements introduce significant security challenges, with increasingly sophisticated cyber-attacks. This paper proposes a robust intrusion detection system (IDS) using federated learning and large language models (LLMs)
arXiv Detail & Related papers (2024-09-28T15:56:28Z)
Comparison of edge computing methods in Internet of Things architectures for efficient estimation of indoor environmental parameters with Machine Learning [0.0]
Two methods are proposed to implement lightweight Machine Learning models that estimate indoor environmental quality (IEQ) parameters. Their implementation is based on centralised and distributed parallel IoT architectures, connected via wireless. The training and testing of ML models is accomplished with experiments focused on small temperature and illuminance datasets.
arXiv Detail & Related papers (2024-02-07T21:15:18Z)
EdgeYOLO: An Edge-Real-Time Object Detector [69.41688769991482]
This paper proposes an efficient, low-complexity and anchor-free object detector based on the state-of-the-art YOLO framework. We develop an enhanced data augmentation method to effectively suppress overfitting during training, and design a hybrid random loss function to improve the detection accuracy of small objects. Our baseline model can reach the accuracy of 50.6% AP50:95 and 69.8% AP50 in MS 2017 dataset, 26.4% AP50:95 and 44.8% AP50 in VisDrone 2019-DET dataset, and it meets real-time requirements (FPS>=30) on edge-computing device Nvidia
arXiv Detail & Related papers (2023-02-15T06:05:14Z)
A lightweight and accurate YOLO-like network for small target detection in Aerial Imagery [94.78943497436492]
We present YOLO-S, a simple, fast and efficient network for small target detection. YOLO-S exploits a small feature extractor based on Darknet20, as well as skip connection, via both bypass and concatenation. YOLO-S has an 87% decrease of parameter size and almost one half FLOPs of YOLOv3, making practical the deployment for low-power industrial applications.
arXiv Detail & Related papers (2022-04-05T16:29:49Z)
A TinyML Platform for On-Device Continual Learning with Quantized Latent Replays [66.62377866022221]
Latent Replay-based Continual Learning (CL) techniques enable online, serverless adaptation in principle. We introduce a HW/SW platform for end-to-end CL based on a 10-core FP32-enabled parallel ultra-low-power processor. Our results show that by combining these techniques, continual learning can be achieved in practice using less than 64MB of memory.
arXiv Detail & Related papers (2021-10-20T11:01:23Z)
Energy-Efficient Model Compression and Splitting for Collaborative Inference Over Time-Varying Channels [52.60092598312894]
We propose a technique to reduce the total energy bill at the edge device by utilizing model compression and time-varying model split between the edge and remote nodes. Our proposed solution results in minimal energy consumption and $CO$ emission compared to the considered baselines.
arXiv Detail & Related papers (2021-06-02T07:36:27Z)
FastFlowNet: A Lightweight Network for Fast Optical Flow Estimation [81.76975488010213]
Dense optical flow estimation plays a key role in many robotic vision tasks. Current networks often occupy large number of parameters and require heavy computation costs. Our proposed FastFlowNet works in the well-known coarse-to-fine manner with following innovations.
arXiv Detail & Related papers (2021-03-08T03:09:37Z)
Moving Object Classification with a Sub-6 GHz Massive MIMO Array using Real Data [64.48836187884325]
Classification between different activities in an indoor environment using wireless signals is an emerging technology for various applications. In this paper, we analyze classification of moving objects by employing machine learning on real data from a massive multi-input-multi-output (MIMO) system in an indoor environment.
arXiv Detail & Related papers (2021-02-09T15:48:35Z)
Gait Recovery System for Parkinson's Disease using Machine Learning on Embedded Platforms [0.052498055901649014]
Freezing of Gait (FoG) is a common gait deficit among patients diagnosed with Parkinson's Disease (PD) The authors propose a ubiquitous embedded system that detects FOG events with a Machine Learning subsystem from accelerometer signals.
arXiv Detail & Related papers (2020-04-13T08:03:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.