Related papers: Real-Time Performance Benchmarking of TinyML Models in Embedded Systems (PICO: Performance of Inference, CPU, and Operations)

Real-Time Performance Benchmarking of TinyML Models in Embedded Systems (PICO: Performance of Inference, CPU, and Operations)

URL: http://arxiv.org/abs/2509.04721v1
Date: Fri, 05 Sep 2025 00:30:39 GMT
Title: Real-Time Performance Benchmarking of TinyML Models in Embedded Systems (PICO: Performance of Inference, CPU, and Operations)
Authors: Abhishek Dey, Saurabh Srivastava, Gaurav Singh, Robert G. Pettit,
Abstract summary: PICO-TINYML-BENCHMARK is a framework for benchmarking the real-time performance of TinyML models on resource-constrained embedded systems.<n>We benchmark three representative TinyML models on two widely adopted platforms, BeagleBone AI64 and Raspberry Pi 4.<n>Results reveal critical trade-offs: the BeagleBone AI64 demonstrates consistent inference latency for AI-specific tasks, while the Raspberry Pi 4 excels in resource efficiency and cost-effectiveness.
Score: 5.637804042390397
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This paper presents PICO-TINYML-BENCHMARK, a modular and platform-agnostic framework for benchmarking the real-time performance of TinyML models on resource-constrained embedded systems. Evaluating key metrics such as inference latency, CPU utilization, memory efficiency, and prediction stability, the framework provides insights into computational trade-offs and platform-specific optimizations. We benchmark three representative TinyML models -- Gesture Classification, Keyword Spotting, and MobileNet V2 -- on two widely adopted platforms, BeagleBone AI64 and Raspberry Pi 4, using real-world datasets. Results reveal critical trade-offs: the BeagleBone AI64 demonstrates consistent inference latency for AI-specific tasks, while the Raspberry Pi 4 excels in resource efficiency and cost-effectiveness. These findings offer actionable guidance for optimizing TinyML deployments, bridging the gap between theoretical advancements and practical applications in embedded systems.

Related papers

DABench-LLM: Standardized and In-Depth Benchmarking of Post-Moore Dataflow AI Accelerators for LLMs [18.46752801066992]
We introduce DABench-LLM, a benchmarking framework for evaluating large language models on dataflow-based accelerators.<n>We validate DABench-LLM on three commodity dataflow accelerators, Cerebras WSE-2, SambaNova RDU, and Graphcore IPU.
arXiv Detail & Related papers (2025-12-04T22:43:14Z)
MOBIUS: Big-to-Mobile Universal Instance Segmentation via Multi-modal Bottleneck Fusion and Calibrated Decoder Pruning [91.90342432541138]
Scaling up model size and training data has advanced foundation models for instance-level perception.<n>High computational cost limits adoption on resource-constrained platforms.<n>We introduce a new benchmark for efficient segmentation on both high-performance computing platforms and mobile devices.
arXiv Detail & Related papers (2025-10-16T18:00:00Z)
MiniCPM-V 4.5: Cooking Efficient MLLMs via Architecture, Data, and Training Recipe [68.04078852416248]
MiniCPM-V 4.5 is an 8B parameter model designed for high efficiency and strong performance.<n>We introduce three core improvements in model architecture, data strategy and training method.<n>MiniCPM-V 4.5 achieves state-of-the-art performance among models under 30B size.
arXiv Detail & Related papers (2025-09-16T19:41:48Z)
Pushing the Envelope of LLM Inference on AI-PC [45.081663877447816]
ultra-low-bit models (1/1.58/2-bit) match the perplexity and end-task performance of their full-precision counterparts using the same model size.<n>The computational efficiency of state-of-the-art inference runtimes (e.g. bitnet) used to deploy them remains underexplored.<n>We take a bottom-up approach: we first design and implement 1-bit and 2-bit microkernels optimized for modern CPUs, achieving peak computational efficiency.<n>We present end-to-end inference results with 2-bit models that outperform the current SOTA runtime bitnet
arXiv Detail & Related papers (2025-08-08T23:33:38Z)
Forecasting LLM Inference Performance via Hardware-Agnostic Analytical Modeling [0.02091806248191979]
We introduce LIFE, a lightweight and modular analytical framework that is comprised of modular analytical model of operators.<n>LIFE characterizes the influence of software and model optimizations, such as quantization, KV cache compression, LoRA adapters, chunked prefill, different attentions, and operator fusion.<n>We validate LIFE's forecasting with inference on AMD CPUs, NPUs, iGPUs and NVIDIA V100 GPUs, with Llama2-7B variants.
arXiv Detail & Related papers (2025-07-29T03:08:31Z)
MiniCPM4: Ultra-Efficient LLMs on End Devices [126.22958722174583]
MiniCPM4 is a highly efficient large language model (LLM) designed explicitly for end-side devices.<n>We achieve this efficiency through systematic innovation in four key dimensions: model architecture, training data, training algorithms, and inference systems.
arXiv Detail & Related papers (2025-06-09T16:16:50Z)
Harnessing On-Device Large Language Model: Empirical Results and Implications for AI PC [8.837470787975308]
Large Language Models (LLMs) on edge devices offer significant privacy benefits.<n>These on-device LLMs inherently face performance limitations due to reduced model capacity and necessary compression techniques.<n>We introduce a systematic methodology -- encompassing model capability, development efficiency, and system resources -- for evaluating on-device LLMs.
arXiv Detail & Related papers (2025-05-21T02:23:01Z)
Can LLMs Revolutionize the Design of Explainable and Efficient TinyML Models? [8.953379216683736]
This paper introduces a novel framework for designing efficient neural network architectures specifically tailored to tiny machine learning (TinyML) platforms.<n>By leveraging large language models (LLMs) for neural architecture search (NAS), a vision transformer (ViT)-based knowledge distillation (KD) strategy, and an explainability module, the approach strikes an optimal balance between accuracy, computational efficiency, and memory usage.
arXiv Detail & Related papers (2025-04-13T18:36:03Z)
Cheaply Evaluating Inference Efficiency Metrics for Autoregressive Transformer APIs [66.30706841821123]
Large language models (LLMs) power many state-of-the-art systems in natural language processing. LLMs are extremely computationally expensive, even at inference time. We propose a new metric for comparing inference efficiency across models.
arXiv Detail & Related papers (2023-05-03T21:51:42Z)
Energy-efficient Task Adaptation for NLP Edge Inference Leveraging Heterogeneous Memory Architectures [68.91874045918112]
adapter-ALBERT is an efficient model optimization for maximal data reuse across different tasks. We demonstrate the advantage of mapping the model to a heterogeneous on-chip memory architecture by performing simulations on a validated NLP edge accelerator.
arXiv Detail & Related papers (2023-03-25T14:40:59Z)
Incremental Online Learning Algorithms Comparison for Gesture and Visual Smart Sensors [68.8204255655161]
This paper compares four state-of-the-art algorithms in two real applications: gesture recognition based on accelerometer data and image classification. Our results confirm these systems' reliability and the feasibility of deploying them in tiny-memory MCUs.
arXiv Detail & Related papers (2022-09-01T17:05:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.