QMC: Efficient SLM Edge Inference via Outlier-Aware Quantization and Emergent Memories Co-Design
- URL: http://arxiv.org/abs/2601.14549v1
- Date: Wed, 21 Jan 2026 00:11:34 GMT
- Title: QMC: Efficient SLM Edge Inference via Outlier-Aware Quantization and Emergent Memories Co-Design
- Authors: Nilesh Prasad Pandey, Jangseon Park, Onat Gungor, Flavio Ponzina, Tajana Rosing,
- Abstract summary: Outlier-aware Quantization Memory Co-design (QMC) is a retraining-free quantization with a novel heterogeneous memory architecture.<n>QMC reduces memory usage by 6.3x-7.3x, external data transfers by 7.6x, energy by 11.7x, and latency by 12.5x when compared to FP16.
- Score: 8.787715061109163
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Deploying Small Language Models (SLMs) on edge platforms is critical for real-time, privacy-sensitive generative AI, yet constrained by memory, latency, and energy budgets. Quantization reduces model size and cost but suffers from device noise in emerging non-volatile memories, while conventional memory hierarchies further limit efficiency. SRAM provides fast access but has low density, DRAM must simultaneously accommodate static weights and dynamic KV caches, which creates bandwidth contention, and Flash, although dense, is primarily used for initialization and remains inactive during inference. These limitations highlight the need for hybrid memory organizations tailored to LLM inference. We propose Outlier-aware Quantization with Memory Co-design (QMC), a retraining-free quantization with a novel heterogeneous memory architecture. QMC identifies inlier and outlier weights in SLMs, storing inlier weights in compact multi-level Resistive-RAM (ReRAM) while preserving critical outliers in high-precision on-chip Magnetoresistive-RAM (MRAM), mitigating noise-induced degradation. On language modeling and reasoning benchmarks, QMC outperforms and matches state-of-the-art quantization methods using advanced algorithms and hybrid data formats, while achieving greater compression under both algorithm-only evaluation and realistic deployment settings. Specifically, compared against SoTA quantization methods on the latest edge AI platform, QMC reduces memory usage by 6.3x-7.3x, external data transfers by 7.6x, energy by 11.7x, and latency by 12.5x when compared to FP16, establishing QMC as a scalable, deployment-ready co-design for efficient on-device inference.
Related papers
- RAM-Net: Expressive Linear Attention with Selectively Addressable Memory [11.262593123857995]
RAM-Net is a novel architecture designed to bridge the gap between the representational capacity of full attention and the memory efficiency of linear models.<n>The core of RAM-Net maps inputs to high-dimensional sparse vectors serving as explicit addresses, allowing the model to selectively access a massive memory state.
arXiv Detail & Related papers (2026-02-12T13:55:29Z) - MSN: A Memory-based Sparse Activation Scaling Framework for Large-scale Industrial Recommendation [19.132874291460936]
We propose MSN, a memory-based sparse activation scaling framework for recommendation models.<n> MSN retrieves personalized representations from a large parameterized memory and integrates them into downstream feature interaction modules.<n> MSN consistently improves recommendation performance while maintaining high efficiency.
arXiv Detail & Related papers (2026-02-07T12:43:51Z) - ZipMoE: Efficient On-Device MoE Serving via Lossless Compression and Cache-Affinity Scheduling [56.88966608455977]
ZipMoE exploits the synergy between the hardware properties of edge devices and the statistical redundancy inherent to MoE parameters.<n>ZipMoE achieves up to $72.77%$ inference latency reduction and up to $6.76times$ higher throughput than the state-of-the-art systems.
arXiv Detail & Related papers (2026-01-29T02:51:59Z) - CXL-SpecKV: A Disaggregated FPGA Speculative KV-Cache for Datacenter LLM Serving [5.216774377033164]
Large Language Models (LLMs) have revolutionized natural language processing tasks.<n>LLMs face challenges due to the massive memory requirements of key-value ( KV) caches.<n>We propose textbfCXL-SpecKV, a novel disaggregated KV-cache architecture.
arXiv Detail & Related papers (2025-12-11T15:40:36Z) - FineServe: Precision-Aware KV Slab and Two-Level Scheduling for Heterogeneous Precision LLM Serving [2.141726730716452]
FineServe is an inference serving framework for mixed-precision large language models.<n>FineServe achieves up to 2.2x higher SLO attainment and 1.8x higher token generation throughput compared to the state-of-the-art GPU sharing systems.
arXiv Detail & Related papers (2025-09-08T00:57:50Z) - End-to-End On-Device Quantization-Aware Training for LLMs at Inference Cost [53.25965863436039]
Quantization-aware training (QAT) provides a more principled solution, but its reliance on backpropagation incurs prohibitive memory costs.<n>We propose ZeroQAT, a zeroth-order optimization-based QAT framework that supports both weight and activation quantization.<n>Experiments show that ZeroQAT consistently outperforms representative PTQ and QAT baselines while requiring significantly less memory.
arXiv Detail & Related papers (2025-08-21T01:18:27Z) - XQuant: Breaking the Memory Wall for LLM Inference with KV Cache Rematerialization [58.92253769255316]
LLM inference is challenging due to substantial memory footprint and bandwidth requirements.<n>XQuant exploits the rapidly increasing compute capabilities of hardware platforms to eliminate the memory bottleneck.<n>XQuant-CL exploits the cross-layer similarity in the X embeddings for extreme compression.
arXiv Detail & Related papers (2025-08-14T06:52:38Z) - ZSMerge: Zero-Shot KV Cache Compression for Memory-Efficient Long-Context LLMs [15.76582272387931]
We propose ZSMerge, a dynamic KV cache compression framework for efficient cache management.<n>ZSMerge significantly enhances memory efficiency and inference speed with negligible performance degradation.
arXiv Detail & Related papers (2025-03-13T03:36:03Z) - A Universal Framework for Compressing Embeddings in CTR Prediction [68.27582084015044]
We introduce a Model-agnostic Embedding Compression (MEC) framework that compresses embedding tables by quantizing pre-trained embeddings.<n>Our approach consists of two stages: first, we apply popularity-weighted regularization to balance code distribution between high- and low-frequency features.<n> Experiments on three datasets reveal that our method reduces memory usage by over 50x while maintaining or improving recommendation performance.
arXiv Detail & Related papers (2025-02-21T10:12:34Z) - QuantSpec: Self-Speculative Decoding with Hierarchical Quantized KV Cache [67.84112700032007]
Large Language Models (LLMs) are increasingly being deployed on edge devices for long-context settings.<n>In these scenarios, the Key-Value ( KV) cache is the primary bottleneck in terms of both GPU memory and latency.<n>We propose a novel self-speculative decoding framework, QuantSpec, where the draft model shares the architecture of the target model but employs a hierarchical 4-bit quantized KV cache and 4-bit quantized weights for acceleration.
arXiv Detail & Related papers (2025-02-05T20:43:48Z) - COMPASS: A Compiler Framework for Resource-Constrained Crossbar-Array Based In-Memory Deep Learning Accelerators [6.172271429579593]
We propose a compiler framework for resource-constrained crossbar-based processing-in-memory (PIM) deep neural network (DNN) accelerators.<n>We propose an algorithm to determine the optimal partitioning that divides the layers so that each partition can be accelerated on chip.
arXiv Detail & Related papers (2025-01-12T11:31:25Z) - ThinK: Thinner Key Cache by Query-Driven Pruning [63.13363917871414]
Large Language Models (LLMs) have revolutionized the field of natural language processing, achieving unprecedented performance across a variety of applications.<n>This paper focuses on the long-context scenario, addressing the inefficiencies in KV cache memory consumption during inference.<n>We propose ThinK, a novel query-dependent KV cache pruning method designed to minimize attention weight loss while selectively pruning the least significant channels.
arXiv Detail & Related papers (2024-07-30T17:59:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.