Related papers: Accelerating Local LLMs on Resource-Constrained Edge Devices via Distributed Prompt Caching

Accelerating Local LLMs on Resource-Constrained Edge Devices via Distributed Prompt Caching

URL: http://arxiv.org/abs/2602.22812v1
Date: Thu, 26 Feb 2026 09:53:17 GMT
Title: Accelerating Local LLMs on Resource-Constrained Edge Devices via Distributed Prompt Caching
Authors: Hiroki Matsutani, Naoki Matsuda, Naoto Sugiura,
Abstract summary: Local LLM inference on resource-constrained edge devices imposes a severe performance bottleneck.<n>This paper proposes distributed prompt caching to enhance inference performance by cooperatively sharing intermediate processing states across multiple low-end edge devices.
Score: 1.0832844764942349
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Since local LLM inference on resource-constrained edge devices imposes a severe performance bottleneck, this paper proposes distributed prompt caching to enhance inference performance by cooperatively sharing intermediate processing states across multiple low-end edge devices. To fully utilize prompt similarity, our distributed caching mechanism also supports partial matching. As this approach introduces communication overhead associated with state sharing over a wireless network, we introduce a Bloom-filter-based data structure, referred to as a catalog, to determine whether a remote server possesses the desired internal states, thereby suppressing unnecessary communication. Experiments using the Gemma-3 270M model and the MMLU dataset on the Raspberry Pi Zero 2W platform demonstrate that the proposed approach reduces TTFT (Time to First Token) and TTLT (Time to Last Token) by 93.12% and 50.07% on average, respectively.

Related papers

Wireless Federated Multi-Task LLM Fine-Tuning via Sparse-and-Orthogonal LoRA [61.12136997430116]
Decentralized federated learning (DFL) based on low-rank adaptation (LoRA) enables mobile devices with multi-task datasets to collaboratively fine-tune a large language model (LLM) by exchanging locally updated parameters with a subset of neighboring devices via wireless connections for knowledge integration.<n> directly aggregating parameters fine-tuned on heterogeneous datasets induces three primary issues across the DFL life-cycle: (i) catastrophic knowledge forgetting during fine-tuning process, arising from conflicting update directions caused by data heterogeneity; (ii) textitinefficient communication and convergence during model aggregation process,
arXiv Detail & Related papers (2026-02-24T02:45:32Z)
HALO: Semantic-Aware Distributed LLM Inference in Lossy Edge Network [50.33808558714122]
Large language models' (LLMs) inference at the edge can facilitate prompt service responsiveness while protecting user privacy.<n>We propose HALO, a novel framework that can boost the distributed LLM inference in lossy edge network.<n> Experimental results from a Raspberry Pi cluster demonstrate that HALO achieves a 3.41x end-to-end speedup for LLaMA-series LLMs under unreliable network conditions.
arXiv Detail & Related papers (2026-01-16T07:37:23Z)
Staggered Batch Scheduling: Co-optimizing Time-to-First-Token and Throughput for High-Efficiency LLM Inference [17.27010833526918]
Staggered Batch Scheduling (SBS) buffers requests to form optimal execution batches.<n> Load-Aware Global Allocation strategy balances computational load across DP units for both Prefill and Decode phases.<n>Our system reduces TTFT by 30%-40% and improves throughput by 15%-20% compared to state-of-the-art immediate scheduling baselines.
arXiv Detail & Related papers (2025-12-18T03:45:05Z)
Why Should the Server Do It All?: A Scalable, Versatile, and Model-Agnostic Framework for Server-Light DNN Inference over Massively Distributed Clients via Training-Free Intermediate Feature Compression [6.932768187544348]
We introduce SLICER, a retraining-free, architecture-agnostic framework that compresses IFs to reduce both communication and server load in split computing.<n>Across standard vision and LLM workloads, SLICER reduces uplink volume by up to 10x and server GPU time by up to 4.4x.
arXiv Detail & Related papers (2025-11-03T08:44:13Z)
Task-Oriented Feature Compression for Multimodal Understanding via Device-Edge Co-Inference [54.53508601749513]
We propose a task-oriented feature compression (TOFC) method for multimodal understanding in a device-edge co-inference framework.<n>To enhance compression efficiency, multiple entropy models are adaptively selected based on the characteristics of the visual features.<n>Results show that TOFC achieves up to 52% reduction in data transmission overhead and 63% reduction in system latency.
arXiv Detail & Related papers (2025-03-17T08:37:22Z)
Adaptive Federated Pruning in Hierarchical Wireless Networks [69.6417645730093]
Federated Learning (FL) is a privacy-preserving distributed learning framework where a server aggregates models updated by multiple devices without accessing their private datasets. In this paper, we introduce model pruning for HFL in wireless networks to reduce the neural network scale. We show that our proposed HFL with model pruning achieves similar learning accuracy compared with the HFL without model pruning and reduces about 50 percent communication cost.
arXiv Detail & Related papers (2023-05-15T22:04:49Z)
Fast Distributed Inference Serving for Large Language Models [12.703624317418237]
We present FastServe, a distributed inference serving system for large language models (LLMs) FastServe exploits the autoregressive pattern of LLM inference to enable preemption at the granularity of each output token. We build a system prototype of FastServe and experimental results show that compared to the state-of-the-art solution vLLM, FastServe improves the throughput by up to 31.4x and 17.9x under the same average and tail latency requirements, respectively.
arXiv Detail & Related papers (2023-05-10T06:17:50Z)
Efficient Multi-stage Inference on Tabular Data [1.6371451481715193]
Conventional wisdom favors segregating ML code into services queried by product code via RPC APIs. We simplify inference algorithms and embed them into the product code to reduce network communication. By applying our optimization with AutoML to both training and inference, we reduce inference latency by 1.3x, CPU resources by 30%, and network communication between application front-end and ML back-end by about 50%.
arXiv Detail & Related papers (2023-03-21T04:01:55Z)
Receptive Field-based Segmentation for Distributed CNN Inference Acceleration in Collaborative Edge Computing [93.67044879636093]
We study inference acceleration using distributed convolutional neural networks (CNNs) in collaborative edge computing network. We propose a novel collaborative edge computing using fused-layer parallelization to partition a CNN model into multiple blocks of convolutional layers.
arXiv Detail & Related papers (2022-07-22T18:38:11Z)
Asynchronous Parallel Incremental Block-Coordinate Descent for Decentralized Machine Learning [55.198301429316125]
Machine learning (ML) is a key technique for big-data-driven modelling and analysis of massive Internet of Things (IoT) based intelligent and ubiquitous computing. For fast-increasing applications and data amounts, distributed learning is a promising emerging paradigm since it is often impractical or inefficient to share/aggregate data. This paper studies the problem of training an ML model over decentralized systems, where data are distributed over many user devices.
arXiv Detail & Related papers (2022-02-07T15:04:15Z)
LCP: A Low-Communication Parallelization Method for Fast Neural Network Inference in Image Recognition [33.581285906182075]
We propose a low-communication parallelization (LCP) method in which models consist of several almost-independent and narrow branches. We deploy LCP models on three distributed systems: AWS instances, Raspberry Pis, and PYNQ boards. LCP models achieve a maximum and average speedups of 56x and 7x, compared to the originals, which could be improved by up to an average speedup of 33x.
arXiv Detail & Related papers (2020-03-13T19:52:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.