An Experimental Study of Split-Learning TinyML on Ultra-Low-Power Edge/IoT Nodes
- URL: http://arxiv.org/abs/2507.16594v1
- Date: Tue, 22 Jul 2025 13:50:12 GMT
- Title: An Experimental Study of Split-Learning TinyML on Ultra-Low-Power Edge/IoT Nodes
- Authors: Zied Jenhani, Mounir Bensalem, Jasenka Dizdarević, Admela Jukan,
- Abstract summary: Split learning (SL) executes part of the inference process on the sensor and off-loads the remainder to a companion device.<n>This paper presents the first end-to-end TinyML + SL testbed built on Espressif ESP32-S3 boards.<n>We benchmark the performance of a MobileNetV2 image recognition model, which is quantized to 8-bit integers, partitioned, and delivered to the nodes via over-the-air updates.
- Score: 2.624902795082451
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Running deep learning inference directly on ultra-low-power edge/IoT nodes has been limited by the tight memory and compute budgets of microcontrollers. Split learning (SL) addresses this limitation in which it executes part of the inference process on the sensor and off-loads the remainder to a companion device. In the context of constrained devices and the related impact of low-power, over-the-air transport protocols, the performance of split learning remains largely unexplored. TO the best of our knowledge, this paper presents the first end-to-end TinyML + SL testbed built on Espressif ESP32-S3 boards, designed to benchmark the over-the-air performance of split learning TinyML in edge/IoT environments. We benchmark the performance of a MobileNetV2 image recognition model, which is quantized to 8-bit integers, partitioned, and delivered to the nodes via over-the-air updates. The intermediate activations are exchanged through different wireless communication methods: ESP-NOW, BLE, and traditional UDP/IP and TCP/IP, enabling a head-to-head comparison on identical hardware. Measurements show that splitting the model after block_16_project_BN layer generates a 5.66 kB tensor that traverses the link in 3.2 ms, when UDP is used, achieving a steady-state round-trip latency of 5.8 s. ESP-NOW presents the most favorable RTT performance 3.7 s; BLE extends battery life further but increases latency beyond 10s.
Related papers
- How Learnable Grids Recover Fine Detail in Low Dimensions: A Neural Tangent Kernel Analysis of Multigrid Parametric Encodings [106.3726679697804]
We compare the two most common techniques for mitigating this spectral bias: Fourier feature encodings (FFE) and multigrid parametric encodings (MPE)<n>MPEs are seen as the standard for low dimensional mappings, but MPEs often outperform them and learn representations with higher resolution and finer detail.<n>We prove that MPEs improve a network's performance through the structure of their grid and not their learnable embedding.
arXiv Detail & Related papers (2025-04-18T02:18:08Z) - TPI-LLM: Serving 70B-scale LLMs Efficiently on Low-resource Edge Devices [36.714057078457195]
We present TPI-LLM, a compute- and memory-efficient tensor parallel inference system for 70B-scale models.
TPI-LLM keeps sensitive raw data local in the users' devices and introduces a sliding window memory scheduler.
We show that TPI-LLM demonstrated over 80% less time-to-first-token and token latency compared to Accelerate.
arXiv Detail & Related papers (2024-10-01T09:18:56Z) - Fostc3net:A Lightweight YOLOv5 Based On the Network Structure Optimization [11.969138981034247]
This paper presents an enhanced lightweight YOLOv5 technique customized for mobile devices.
The proposed model achieves a 1% increase in detection accuracy, a 13% reduction in FLOPs, and a 26% decrease in model parameters compared to the existing YOLOv5.
arXiv Detail & Related papers (2024-03-20T16:07:04Z) - Latency-aware Unified Dynamic Networks for Efficient Image Recognition [72.8951331472913]
LAUDNet is a framework to bridge the theoretical and practical efficiency gap in dynamic networks.
It integrates three primary dynamic paradigms-spatially adaptive computation, dynamic layer skipping, and dynamic channel skipping.
It can notably reduce the latency of models like ResNet by over 50% on platforms such as V100,3090, and TX2 GPUs.
arXiv Detail & Related papers (2023-08-30T10:57:41Z) - Federated Split Learning with Only Positive Labels for
resource-constrained IoT environment [4.055662817794178]
Distributed collaborative machine learning (DCML) is a promising method in the Internet of Things (IoT) domain for training deep learning models.
We propose splitfed learning, known as splitfed learning (SFL), is the most suitable for efficient training and testing when devices have limited computational capabilities.
We show that SFPL outperforms SFL when resource-constrained IoT devices have only positive labeled data.
arXiv Detail & Related papers (2023-07-25T05:33:06Z) - DIVISION: Memory Efficient Training via Dual Activation Precision [60.153754740511864]
State-of-the-art work combines a search of quantization bit-width with the training, which makes the procedure complicated and less transparent.
We propose a simple and effective method to compress DNN training.
Experiment results show DIVISION has better comprehensive performance than state-of-the-art methods, including over 10x compression of activation maps and competitive training throughput, without loss of model accuracy.
arXiv Detail & Related papers (2022-08-05T03:15:28Z) - Lightweight and Progressively-Scalable Networks for Semantic
Segmentation [100.63114424262234]
Multi-scale learning frameworks have been regarded as a capable class of models to boost semantic segmentation.
In this paper, we thoroughly analyze the design of convolutional blocks and the ways of interactions across multiple scales.
We devise Lightweight and Progressively-Scalable Networks (LPS-Net) that novelly expands the network complexity in a greedy manner.
arXiv Detail & Related papers (2022-07-27T16:00:28Z) - An Adaptive Device-Edge Co-Inference Framework Based on Soft
Actor-Critic [72.35307086274912]
High-dimension parameter model and large-scale mathematical calculation restrict execution efficiency, especially for Internet of Things (IoT) devices.
We propose a new Deep Reinforcement Learning (DRL)-Soft Actor Critic for discrete (SAC-d), which generates the emphexit point, emphexit point, and emphcompressing bits by soft policy iterations.
Based on the latency and accuracy aware reward design, such an computation can well adapt to the complex environment like dynamic wireless channel and arbitrary processing, and is capable of supporting the 5G URL
arXiv Detail & Related papers (2022-01-09T09:31:50Z) - A TinyML Platform for On-Device Continual Learning with Quantized Latent
Replays [66.62377866022221]
Latent Replay-based Continual Learning (CL) techniques enable online, serverless adaptation in principle.
We introduce a HW/SW platform for end-to-end CL based on a 10-core FP32-enabled parallel ultra-low-power processor.
Our results show that by combining these techniques, continual learning can be achieved in practice using less than 64MB of memory.
arXiv Detail & Related papers (2021-10-20T11:01:23Z) - FastFlowNet: A Lightweight Network for Fast Optical Flow Estimation [81.76975488010213]
Dense optical flow estimation plays a key role in many robotic vision tasks.
Current networks often occupy large number of parameters and require heavy computation costs.
Our proposed FastFlowNet works in the well-known coarse-to-fine manner with following innovations.
arXiv Detail & Related papers (2021-03-08T03:09:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.