Bare-Metal Tensor Virtualization: Overcoming the Memory Wall in Edge-AI Inference on ARM64
- URL: http://arxiv.org/abs/2601.03324v1
- Date: Tue, 06 Jan 2026 15:00:40 GMT
- Title: Bare-Metal Tensor Virtualization: Overcoming the Memory Wall in Edge-AI Inference on ARM64
- Authors: Bugra Kilictas, Faruk Alpay,
- Abstract summary: "Virtual Core" architecture implemented in software optimized for ARM64 microarchitectures (Apple Silicon)<n>"Software-Defined Direct Memory Access (DMA)" guarantees 100% cache line utilization for weight, while our zero-copy loader eliminates latency.<n> Experimental results on a 110M-second model demonstrate a stable throughput of >60 tokens/second on M2 hardware.
- Score: 0.5729426778193398
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The deployment of Large Language Models (LLMs) on edge devices is fundamentally constrained by the "Memory Wall" the bottleneck where data movement latency outstrips arithmetic throughput. Standard inference runtimes often incur significant overhead through high-level abstractions, dynamic dispatch, and unaligned memory access patterns. In this work, we present a novel "Virtual Tensor Core" architecture implemented in software, optimized specifically for ARM64 microarchitectures (Apple Silicon). By bypassing standard library containers in favor of direct memory mapping (mmap) and implementing hand-tuned NEON SIMD kernels, we achieve a form of "Software-Defined Direct Memory Access (DMA)." Our proposed Tensor Virtualization Layout (TVL) guarantees 100% cache line utilization for weight matrices, while our zero-copy loader eliminates initialization latency. Experimental results on a 110M parameter model demonstrate a stable throughput of >60 tokens/second on M2 hardware. While proprietary hardware accelerators (e.g., Apple AMX) can achieve higher peak throughput, our architecture provides a fully open, portable, and deterministic reference implementation for studying the memory bottleneck on general-purpose ARM silicon, meeting the 200ms psycholinguistic latency threshold without opaque dependencies.
Related papers
- HyperOffload: Graph-Driven Hierarchical Memory Management for Large Language Models on SuperNode Architectures [20.525243835887558]
SuperNode represents data movement using cache operators within the compiler.<n>We implement SuperNode within the production deep learning framework MindSpore.<n>We show that SuperNode reduces peak device memory usage by up to 26% for inference while maintaining end-to-end performance.
arXiv Detail & Related papers (2026-01-31T14:29:13Z) - QMC: Efficient SLM Edge Inference via Outlier-Aware Quantization and Emergent Memories Co-Design [8.787715061109163]
Outlier-aware Quantization Memory Co-design (QMC) is a retraining-free quantization with a novel heterogeneous memory architecture.<n>QMC reduces memory usage by 6.3x-7.3x, external data transfers by 7.6x, energy by 11.7x, and latency by 12.5x when compared to FP16.
arXiv Detail & Related papers (2026-01-21T00:11:34Z) - ODMA: On-Demand Memory Allocation Framework for LLM Serving on LPDDR-Class Accelerators [14.238528502723787]
Large language models (LLMs) on accelerators with poor random-access bandwidth are limited by current memory managers.<n>We present ODMA, an on-demand memory allocation framework for RACM.<n>ODMA addresses distribution drift and heavy-tailed requests by coupling a lightweight length predictor with dynamic bucket partitioning and a large-bucket safeguard.
arXiv Detail & Related papers (2025-12-10T08:52:20Z) - APT-LLM: Exploiting Arbitrary-Precision Tensor Core Computing for LLM Acceleration [5.075697428779204]
Large language models (LLMs) have revolutionized AI applications, yet their enormous computational demands severely limit deployment and real-time performance.<n>This is primarily due to the limited support for the GPU Cores, inefficient memory management, and inflexible kernel optimizations.<n>We propose a comprehensive acceleration scheme for arbitrary precision LLMs, namely APT-LLM.
arXiv Detail & Related papers (2025-08-26T14:48:29Z) - SpecMemo: Speculative Decoding is in Your Pocket [7.062887337934677]
Speculative decoding inherently relies on sacrificing extra memory allocations to generate several candidate tokens.<n>We present a device-aware inference engine named SpecMemo that can smartly control memory allocations at finer levels.<n>With SpecMemo's memory management, we maintain 96% of overall throughput from speculative decoding on MT-Bench.
arXiv Detail & Related papers (2025-05-16T22:12:29Z) - LiVOS: Light Video Object Segmentation with Gated Linear Matching [116.58237547253935]
LiVOS is a lightweight memory network that employs linear matching via linear attention.
For longer and higher-resolution videos, it matched STM-based methods with 53% less GPU memory and supports 4096p inference on a 32G consumer-grade GPU.
arXiv Detail & Related papers (2024-11-05T05:36:17Z) - MARLIN: Mixed-Precision Auto-Regressive Parallel Inference on Large Language Models [58.3342517278868]
This paper describes the design of Mixed-precision AutoRegressive LINear kernels.
It shows that batchsizes up to 16-32 can be supported with close to maximum ($4times$) quantization speedup.
MarLIN accomplishes this via a combination of techniques, such as asynchronous memory access, complex task scheduling and pipelining.
arXiv Detail & Related papers (2024-08-21T16:10:41Z) - vTensor: Flexible Virtual Tensor Management for Efficient LLM Serving [53.972175896814505]
Large Language Models (LLMs) are widely used across various domains, processing millions of daily requests.
Large Language Models (LLMs) are widely used across various domains, processing millions of daily requests.
arXiv Detail & Related papers (2024-07-22T14:37:58Z) - B'MOJO: Hybrid State Space Realizations of Foundation Models with Eidetic and Fading Memory [91.81390121042192]
We develop a class of models called B'MOJO to seamlessly combine eidetic and fading memory within an composable module.
B'MOJO's ability to modulate eidetic and fading memory results in better inference on longer sequences tested up to 32K tokens.
arXiv Detail & Related papers (2024-07-08T18:41:01Z) - vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention [8.20523619534105]
PagedAttention is a popular approach for dynamic memory allocation in LLM serving systems.<n>We present vAttention -- an approach that mitigates fragmentation in physical memory while retaining the contiguity of KV cache in virtual memory.<n>Overall, vAttention is a simpler, portable, and performant alternative to PagedAttention.
arXiv Detail & Related papers (2024-05-07T16:00:32Z) - MAPLE-X: Latency Prediction with Explicit Microprocessor Prior Knowledge [87.41163540910854]
Deep neural network (DNN) latency characterization is a time-consuming process.
We propose MAPLE-X which extends MAPLE by incorporating explicit prior knowledge of hardware devices and DNN architecture latency.
arXiv Detail & Related papers (2022-05-25T11:08:20Z) - MAPLE-Edge: A Runtime Latency Predictor for Edge Devices [80.01591186546793]
We propose MAPLE-Edge, an edge device-oriented extension of MAPLE, the state-of-the-art latency predictor for general purpose hardware.
Compared to MAPLE, MAPLE-Edge can describe the runtime and target device platform using a much smaller set of CPU performance counters.
We also demonstrate that unlike MAPLE which performs best when trained on a pool of devices sharing a common runtime, MAPLE-Edge can effectively generalize across runtimes.
arXiv Detail & Related papers (2022-04-27T14:00:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.