Related papers: An Evaluation of LLMs Inference on Popular Single-board Computers

An Evaluation of LLMs Inference on Popular Single-board Computers

URL: http://arxiv.org/abs/2511.07425v1
Date: Mon, 20 Oct 2025 01:35:45 GMT
Title: An Evaluation of LLMs Inference on Popular Single-board Computers
Authors: Tung, Nguyen, Tuyen Nguyen,
Abstract summary: Single-board computers (SBCs) offer a promising platform for localized, privacy-preserving inference.<n>We benchmark the performance of 25 quantized open-source large language model (LLM) inference runtimes across three SBCs- Raspberry Pi 4, Raspberry Pi 5, and Orange Pi 5 Pro.<n>Our results show that SBCs can reliably support models up to 1.5B parameters, with Llamafile achieving up to 4x higher throughput and 30-40% lower power usage than Ollama.
Score: 0.7200545295680779
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The growing demand for on-device large language model (LLM) inference is driving interest in deploying lightweight, cost-effective AI solutions on edge hardware. Single-board computers (SBCs) such as the Raspberry Pi and Orange Pi offer a promising platform for localized, privacy-preserving inference-but remain underexplored in the context of LLM workloads. In this work, we benchmark the performance of 25 quantized open-source LLMs across three SBCs-Raspberry Pi 4, Raspberry Pi 5, and Orange Pi 5 Pro-using two inference runtimes: Ollama and Llamafile. We evaluate generation throughput, memory usage, and power consumption under varying CPU configurations, using multiple prompt types to simulate realistic workloads. Our results show that SBCs can reliably support models up to 1.5B parameters, with Llamafile achieving up to 4x higher throughput and 30-40% lower power usage than Ollama. We identify architecture-specific bottlenecks, highlight runtime-level trade-offs, and provide practical deployment recommendations. This study offers the first broad evaluation of LLM inference on SBCs, bridging the gap between high-performance language models and affordable edge computing.

Related papers

Real-Time Performance Benchmarking of TinyML Models in Embedded Systems (PICO: Performance of Inference, CPU, and Operations) [5.637804042390397]
PICO-TINYML-BENCHMARK is a framework for benchmarking the real-time performance of TinyML models on resource-constrained embedded systems.<n>We benchmark three representative TinyML models on two widely adopted platforms, BeagleBone AI64 and Raspberry Pi 4.<n>Results reveal critical trade-offs: the BeagleBone AI64 demonstrates consistent inference latency for AI-specific tasks, while the Raspberry Pi 4 excels in resource efficiency and cost-effectiveness.
arXiv Detail & Related papers (2025-09-05T00:30:39Z)
Pushing the Envelope of LLM Inference on AI-PC [45.081663877447816]
ultra-low-bit models (1/1.58/2-bit) match the perplexity and end-task performance of their full-precision counterparts using the same model size.<n>The computational efficiency of state-of-the-art inference runtimes (e.g. bitnet) used to deploy them remains underexplored.<n>We take a bottom-up approach: we first design and implement 1-bit and 2-bit microkernels optimized for modern CPUs, achieving peak computational efficiency.<n>We present end-to-end inference results with 2-bit models that outperform the current SOTA runtime bitnet
arXiv Detail & Related papers (2025-08-08T23:33:38Z)
BitNet b1.58 2B4T Technical Report [118.78752947128682]
We introduce BitNet b1.58 2B4T, the first open-source, native 1-bit Large Language Model (LLM) at the 2-billion parameter scale.<n>Trained on a corpus of 4 trillion tokens, the model has been rigorously evaluated across benchmarks covering language understanding, mathematical reasoning, coding proficiency, and conversational ability.
arXiv Detail & Related papers (2025-04-16T17:51:43Z)
Generative AI on the Edge: Architecture and Performance Evaluation [0.3999851878220877]
6G's AI native vision of embedding advance intelligence in the network requires a systematic evaluation of Generative AI (GenAI) models on edge devices.<n>This research investigates computationally demanding Large Language Models (LLMs) inference on a single commodity Raspberry Pi serving as an edge testbed for ORAN.
arXiv Detail & Related papers (2024-11-18T16:09:01Z)
MoE-Lightning: High-Throughput MoE Inference on Memory-constrained GPUs [55.95879347182669]
MoE architecture is renowned for its ability to increase model capacity without a proportional increase in inference cost. MoE-Lightning introduces a novel CPU-GPU-I/O pipelining schedule, CGOPipe, with paged weights to achieve high resource utilization. MoE-Lightning can achieve up to 10.3x higher throughput than state-of-the-art offloading-enabled LLM inference systems for Mixtral 8x7B on a single T4 GPU (16GB)
arXiv Detail & Related papers (2024-11-18T01:06:12Z)
MARLIN: Mixed-Precision Auto-Regressive Parallel Inference on Large Language Models [58.3342517278868]
This paper describes the design of Mixed-precision AutoRegressive LINear kernels. It shows that batchsizes up to 16-32 can be supported with close to maximum ($4times$) quantization speedup. MarLIN accomplishes this via a combination of techniques, such as asynchronous memory access, complex task scheduling and pipelining.
arXiv Detail & Related papers (2024-08-21T16:10:41Z)
Preble: Efficient Distributed Prompt Scheduling for LLM Serving [8.706905652975554]
This paper proposes Preble, the first distributed LLM serving platform that targets and optimize for prompt sharing. We designed a distributed scheduling system that co-optimizes KV state reuse and computation load-balancing with a new scheduling algorithm and a hierarchical scheduling mechanism. Our evaluation of Preble with real workloads and request arrival patterns on two open-source LLMs shows that Preble outperforms the SOTA serving systems by 1.5X to 14.5X on average latency and 2X to 10X on p99 latency.
arXiv Detail & Related papers (2024-05-08T06:30:58Z)
Optimizing LLM Queries in Relational Data Analytics Workloads [50.95919232839785]
Batch data analytics is a growing application for Large Language Models (LLMs)<n>LLMs enable users to perform a wide range of natural language tasks, such as classification, entity extraction, and translation, over large datasets.<n>We propose novel techniques that can significantly reduce the cost of LLM calls for relational data analytics workloads.
arXiv Detail & Related papers (2024-03-09T07:01:44Z)
Distributed Inference and Fine-tuning of Large Language Models Over The Internet [91.00270820533272]
Large language models (LLMs) are useful in many NLP tasks and become more capable with size. These models require high-end hardware, making them inaccessible to most researchers. We develop fault-tolerant inference algorithms and load-balancing protocols that automatically assign devices to maximize the total system throughput.
arXiv Detail & Related papers (2023-12-13T18:52:49Z)
Efficient LLM Inference on CPUs [8.802223672775844]
Large language models (LLMs) have demonstrated remarkable performance and tremendous potential across a wide range of tasks. deploying these models has been challenging due to the astronomical amount of model parameters. We propose an effective approach that can make the deployment of LLMs more efficiently.
arXiv Detail & Related papers (2023-11-01T13:08:50Z)
Federated Fine-Tuning of LLMs on the Very Edge: The Good, the Bad, the Ugly [62.473245910234304]
This paper takes a hardware-centric approach to explore how Large Language Models can be brought to modern edge computing systems. We provide a micro-level hardware benchmark, compare the model FLOP utilization to a state-of-the-art data center GPU, and study the network utilization in realistic conditions.
arXiv Detail & Related papers (2023-10-04T20:27:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.