ODMA: On-Demand Memory Allocation Framework for LLM Serving on LPDDR-Class Accelerators
- URL: http://arxiv.org/abs/2512.09427v1
- Date: Wed, 10 Dec 2025 08:52:20 GMT
- Title: ODMA: On-Demand Memory Allocation Framework for LLM Serving on LPDDR-Class Accelerators
- Authors: Guoqiang Zou, Wanyu Wang, Hao Zheng, Longxiang Yin, Yinhe Han,
- Abstract summary: Large language models (LLMs) on accelerators with poor random-access bandwidth are limited by current memory managers.<n>We present ODMA, an on-demand memory allocation framework for RACM.<n>ODMA addresses distribution drift and heavy-tailed requests by coupling a lightweight length predictor with dynamic bucket partitioning and a large-bucket safeguard.
- Score: 14.238528502723787
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Serving large language models (LLMs) on accelerators with poor random-access bandwidth (e.g., LPDDR5-based) is limited by current memory managers. Static pre-allocation wastes memory, while fine-grained paging (e.g., PagedAttention) is ill-suited due to high random-access costs. Existing HBM-centric solutions do not exploit the characteristics of random-access-constrained memory (RACM) accelerators like Cambricon MLU370. We present ODMA, an on-demand memory allocation framework for RACM. ODMA addresses distribution drift and heavy-tailed requests by coupling a lightweight length predictor with dynamic bucket partitioning and a large-bucket safeguard. Boundaries are periodically updated from live traces to maximize utilization. On Alpaca and Google-NQ, ODMA improves prediction accuracy of prior work significantly (e.g., from 82.68% to 93.36%). Serving DeepSeek-R1-Distill-Qwen-7B on Cambricon MLU370-X4, ODMA raises memory utilization from 55.05% to 72.45% and improves RPS and TPS by 29% and 27% over static baselines. This demonstrates that hardware-aware allocation unlocks efficient LLM serving on RACM platforms.
Related papers
- Bare-Metal Tensor Virtualization: Overcoming the Memory Wall in Edge-AI Inference on ARM64 [0.5729426778193398]
"Virtual Core" architecture implemented in software optimized for ARM64 microarchitectures (Apple Silicon)<n>"Software-Defined Direct Memory Access (DMA)" guarantees 100% cache line utilization for weight, while our zero-copy loader eliminates latency.<n> Experimental results on a 110M-second model demonstrate a stable throughput of >60 tokens/second on M2 hardware.
arXiv Detail & Related papers (2026-01-06T15:00:40Z) - SafeLoad: Efficient Admission Control Framework for Identifying Memory-Overloading Queries in Cloud Data Warehouses [59.68732483257323]
Memory overload is a common form of resource exhaustion in cloud data warehouses.<n>We propose SafeLoad, the first query admission control framework specifically designed to identify memory-overloading (MO) queries.<n>We show that SafeLoad achieves state-of-the-art prediction performance with low online and offline time overhead.
arXiv Detail & Related papers (2026-01-05T08:29:51Z) - Agentic Memory: Learning Unified Long-Term and Short-Term Memory Management for Large Language Model Agents [57.38404718635204]
Large language model (LLM) agents face fundamental limitations in long-horizon reasoning due to finite context windows.<n>Existing methods typically handle long-term memory (LTM) and short-term memory (STM) as separate components.<n>We propose Agentic Memory (AgeMem), a unified framework that integrates LTM and STM management directly into the agent's policy.
arXiv Detail & Related papers (2026-01-05T08:24:16Z) - Sangam: Chiplet-Based DRAM-PIM Accelerator with CXL Integration for LLM Inferencing [2.9665163298601342]
Inference, particularly the decoding phase, is dominated by memory-bound GEMV or flat GEMM operations.<n>Existing in/near-memory solutions face critical limitations such as reduced memory capacity.<n>This work presents a chiplet-based memory module that addresses these limitations.
arXiv Detail & Related papers (2025-11-15T16:39:51Z) - DynaKV: Enabling Accurate and Efficient Long-Sequence LLM Decoding on Smartphones [10.813495376006427]
Large language models (LLMs) are increasingly expected to support efficient and effective long-sequence decoding.<n>Due to limited DRAM capacity, long-seuqence LLM decoding on smartphones is constrained by the key-value cache ( KVCache)<n>We propose DynaKV, the first adaptive KVCache management approach that jointly addresses accuracy and efficiency for long-sequence decoding on smartphones.
arXiv Detail & Related papers (2025-10-20T08:56:02Z) - Accelerating LLM Inference via Dynamic KV Cache Placement in Heterogeneous Memory System [20.652641518700346]
Large Language Model (LLM) inference is increasingly constrained by memory bandwidth.<n>Modern AI hardware now integrates high-bandwidth memory (HBM) with high-speed off-package DRAM.<n>This work investigates dynamic KV cache placement across such systems to maximize aggregated bandwidth utilization under capacity constraints.
arXiv Detail & Related papers (2025-08-17T19:07:08Z) - Sparse-dLLM: Accelerating Diffusion LLMs with Dynamic Cache Eviction [72.27673320976933]
Diffusion Large Language Models (dLLMs) enable breakthroughs in reasoning and parallel decoding.<n>Current caching techniques accelerate decoding by storing full-layer states, yet impose substantial memory usage.<n>We propose Sparse-dLLM, the first training-free framework integrating dynamic cache eviction with sparse attention.
arXiv Detail & Related papers (2025-08-04T16:14:03Z) - From Single to Multi-Granularity: Toward Long-Term Memory Association and Selection of Conversational Agents [79.87304940020256]
Large Language Models (LLMs) have been widely adopted in conversational agents.<n>MemGAS is a framework that enhances memory consolidation by constructing multi-granularity association, adaptive selection, and retrieval.<n> Experiments on four long-term memory benchmarks demonstrate that MemGAS outperforms state-of-the-art methods on both question answer and retrieval tasks.
arXiv Detail & Related papers (2025-05-26T06:13:07Z) - MOM: Memory-Efficient Offloaded Mini-Sequence Inference for Long Context Language Models [72.61076288351201]
We propose Memory-efficient Offloaded Mini-sequence Inference (MOM)<n>MOM partitions critical layers into smaller "mini-sequences" and integrates seamlessly with KV cache offloading.<n>On Meta-Llama-3.2-8B, MOM extends the maximum context length from 155k to 455k tokens on a single A100 80GB GPU.
arXiv Detail & Related papers (2025-04-16T23:15:09Z) - A Universal Framework for Compressing Embeddings in CTR Prediction [68.27582084015044]
We introduce a Model-agnostic Embedding Compression (MEC) framework that compresses embedding tables by quantizing pre-trained embeddings.<n>Our approach consists of two stages: first, we apply popularity-weighted regularization to balance code distribution between high- and low-frequency features.<n> Experiments on three datasets reveal that our method reduces memory usage by over 50x while maintaining or improving recommendation performance.
arXiv Detail & Related papers (2025-02-21T10:12:34Z) - COMPASS: A Compiler Framework for Resource-Constrained Crossbar-Array Based In-Memory Deep Learning Accelerators [6.172271429579593]
We propose a compiler framework for resource-constrained crossbar-based processing-in-memory (PIM) deep neural network (DNN) accelerators.<n>We propose an algorithm to determine the optimal partitioning that divides the layers so that each partition can be accelerated on chip.
arXiv Detail & Related papers (2025-01-12T11:31:25Z) - LiVOS: Light Video Object Segmentation with Gated Linear Matching [116.58237547253935]
LiVOS is a lightweight memory network that employs linear matching via linear attention.
For longer and higher-resolution videos, it matched STM-based methods with 53% less GPU memory and supports 4096p inference on a 32G consumer-grade GPU.
arXiv Detail & Related papers (2024-11-05T05:36:17Z) - vTensor: Flexible Virtual Tensor Management for Efficient LLM Serving [53.972175896814505]
Large Language Models (LLMs) are widely used across various domains, processing millions of daily requests.
Large Language Models (LLMs) are widely used across various domains, processing millions of daily requests.
arXiv Detail & Related papers (2024-07-22T14:37:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.