Kelle: Co-design KV Caching and eDRAM for Efficient LLM Serving in Edge Computing
- URL: http://arxiv.org/abs/2510.16040v1
- Date: Thu, 16 Oct 2025 07:12:08 GMT
- Title: Kelle: Co-design KV Caching and eDRAM for Efficient LLM Serving in Edge Computing
- Authors: Tianhua Xia, Sai Qian Zhang,
- Abstract summary: Large Language Models (LLMs) on edge devices are crucial for reducing latency, improving real-time processing, and enhancing privacy.<n> implementing LLMs on edge devices presents challenges, particularly with managing key-value caches.<n>We propose eDRAM as the primary storage for LLM serving in edge device, which offers higher density compared to storage.
- Score: 9.984481065465028
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: Running Large Language Models (LLMs) on edge devices is crucial for reducing latency, improving real-time processing, and enhancing privacy. By performing inference directly on the device, data does not need to be sent to the cloud, ensuring faster responses and reducing reliance on network connectivity. However, implementing LLMs on edge devices presents challenges, particularly with managing key-value (KV) caches, which plays a pivotal role in LLM serving. As the input text lengthens, the size of the KV cache increases linearly with the sequence length, leading to a significant memory footprint and data access costs. On the other hand, edge devices have limited memory and computational power, making it hard to store and efficiently access the large caches needed for LLM inference. To mitigate the substantial overhead caused by KV cache, we propose using embedded DRAM (eDRAM) as the primary storage for LLM serving in edge device, which offers higher storage density compared to SRAM. However, to ensure data integrity, eDRAM needs periodic refresh operations, which are power-intensive. To reduce eDRAM costs and improve overall system performance, we propose~\textit{Kelle}, a software-hardware co-design solution optimized for deploying LLMs on eDRAM-based edge systems. Combined with our fine-grained memory eviction, recomputation, and refresh control algorithms, the \textit{Kelle} accelerator delivers a $3.9\times$ speedup and $4.5\times$ energy savings compared to existing baseline solutions.
Related papers
- Sangam: Chiplet-Based DRAM-PIM Accelerator with CXL Integration for LLM Inferencing [2.9665163298601342]
Inference, particularly the decoding phase, is dominated by memory-bound GEMV or flat GEMM operations.<n>Existing in/near-memory solutions face critical limitations such as reduced memory capacity.<n>This work presents a chiplet-based memory module that addresses these limitations.
arXiv Detail & Related papers (2025-11-15T16:39:51Z) - LightMem: Lightweight and Efficient Memory-Augmented Generation [72.21680105265824]
We introduce a new memory system called LightMem, which strikes a balance between the performance and efficiency of memory systems.<n>Inspired by the Atkinson-Shiffrin model of human memory, LightMem organizes memory into three complementary stages.<n>Experiments on LongMemEval with GPT and Qwen backbones show that LightMem outperforms strong baselines in accuracy (up to 10.9% gains) while reducing token usage by up to 117x.
arXiv Detail & Related papers (2025-10-21T17:58:17Z) - DynaKV: Enabling Accurate and Efficient Long-Sequence LLM Decoding on Smartphones [10.813495376006427]
Large language models (LLMs) are increasingly expected to support efficient and effective long-sequence decoding.<n>Due to limited DRAM capacity, long-seuqence LLM decoding on smartphones is constrained by the key-value cache ( KVCache)<n>We propose DynaKV, the first adaptive KVCache management approach that jointly addresses accuracy and efficiency for long-sequence decoding on smartphones.
arXiv Detail & Related papers (2025-10-20T08:56:02Z) - Accelerating LLM Inference via Dynamic KV Cache Placement in Heterogeneous Memory System [20.652641518700346]
Large Language Model (LLM) inference is increasingly constrained by memory bandwidth.<n>Modern AI hardware now integrates high-bandwidth memory (HBM) with high-speed off-package DRAM.<n>This work investigates dynamic KV cache placement across such systems to maximize aggregated bandwidth utilization under capacity constraints.
arXiv Detail & Related papers (2025-08-17T19:07:08Z) - Sparse-dLLM: Accelerating Diffusion LLMs with Dynamic Cache Eviction [58.044803442346115]
Diffusion Large Language Models (dLLMs) enable breakthroughs in reasoning and parallel decoding but suffer from prohibitive computational complexity and memory overhead during inference.<n>We propose Sparse-dLLM, the first training-free framework integrating dynamic cache eviction with sparse attention via delayed bidirectional sparse caching.
arXiv Detail & Related papers (2025-08-04T16:14:03Z) - LaCache: Ladder-Shaped KV Caching for Efficient Long-Context Modeling of Large Language Models [52.56008278458534]
LaCache is a training-free method for efficient and accurate generative inference of Large Language Models.<n>LaCache enables LLMs to address both of the critical challenges in long-range modeling: robust long-range capabilities and continuous generation without running out-of-memory.
arXiv Detail & Related papers (2025-07-14T19:09:57Z) - InstInfer: In-Storage Attention Offloading for Cost-Effective Long-Context LLM Inference [10.115950753431528]
Large Language Models (LLMs) are a significant milestone in generative AI.
The increasing context length and batch size in offline LLM inference escalates the memory requirement of the key-value (KV) cache.
Several cost-effective solutions leverage host memory or optimized to reduce storage costs for offline inference scenarios.
We propose InstInfer, which offloads the most performance-critical computation (i.e., attention in decoding phase) and data (i.e., KV cache) parts to Computational Storage Drives (CSDs)
InstInfer improves throughput for long-sequence inference by
arXiv Detail & Related papers (2024-09-08T06:06:44Z) - ThinK: Thinner Key Cache by Query-Driven Pruning [63.13363917871414]
Large Language Models (LLMs) have revolutionized the field of natural language processing, achieving unprecedented performance across a variety of applications.<n>This paper focuses on the long-context scenario, addressing the inefficiencies in KV cache memory consumption during inference.<n>We propose ThinK, a novel query-dependent KV cache pruning method designed to minimize attention weight loss while selectively pruning the least significant channels.
arXiv Detail & Related papers (2024-07-30T17:59:08Z) - vTensor: Flexible Virtual Tensor Management for Efficient LLM Serving [53.972175896814505]
Large Language Models (LLMs) are widely used across various domains, processing millions of daily requests.
Large Language Models (LLMs) are widely used across various domains, processing millions of daily requests.
arXiv Detail & Related papers (2024-07-22T14:37:58Z) - RelayAttention for Efficient Large Language Model Serving with Long System Prompts [59.50256661158862]
This paper aims to improve the efficiency of LLM services that involve long system prompts.
handling these system prompts requires heavily redundant memory accesses in existing causal attention algorithms.
We propose RelayAttention, an attention algorithm that allows reading hidden states from DRAM exactly once for a batch of input tokens.
arXiv Detail & Related papers (2024-02-22T18:58:28Z) - Efficient LLM inference solution on Intel GPU [19.154403468201924]
Transformer based Large Language Models (LLMs) have been widely used in many fields.
We propose an efficient LLM inference solution with low latency and high throughput.
Compared with the standard HuggingFace implementation, the proposed solution achieves up to 7x lower token latency and 27x higher throughput.
arXiv Detail & Related papers (2023-12-19T05:40:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.