Related papers: L3: DIMM-PIM Integrated Architecture and Coordination for Scalable Long-Context LLM Inference

L3: DIMM-PIM Integrated Architecture and Coordination for Scalable Long-Context LLM Inference

URL: http://arxiv.org/abs/2504.17584v1
Date: Thu, 24 Apr 2025 14:14:07 GMT
Title: L3: DIMM-PIM Integrated Architecture and Coordination for Scalable Long-Context LLM Inference
Authors: Qingyuan Liu, Liyan Chen, Yanning Yang, Haocheng Wang, Dong Du, Zhigang Mao, Naifeng Jing, Yubin Xia, Haibo Chen,
Abstract summary: Large Language Models (LLMs) increasingly require processing long text sequences, but GPU memory limitations force difficult trade-offs between memory capacity and bandwidth.<n>We identify that the critical memory bottleneck lies in the decoding phase of multi-head attention.<n>We propose L3, a hardware-software co-designed system integrating DIMM-PIM and GPU devices.
Score: 6.886434948681708
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models (LLMs) increasingly require processing long text sequences, but GPU memory limitations force difficult trade-offs between memory capacity and bandwidth. While HBM-based acceleration offers high bandwidth, its capacity remains constrained. Offloading data to host-side DIMMs improves capacity but introduces costly data swapping overhead. We identify that the critical memory bottleneck lies in the decoding phase of multi-head attention (MHA) exclusively, which demands substantial capacity for storing KV caches and high bandwidth for attention computation. Our key insight reveals this operation uniquely aligns with modern DIMM-based processing-in-memory (PIM) architectures, which offers scalability of both capacity and bandwidth. Based on this observation and insight, we propose L3, a hardware-software co-designed system integrating DIMM-PIM and GPU devices. L3 introduces three innovations: First, hardware redesigns resolve data layout mismatches and computational element mismatches in DIMM-PIM, enhancing LLM inference utilization. Second, communication optimization enables hiding the data transfer overhead with the computation. Third, an adaptive scheduler coordinates GPU-DIMM-PIM operations to maximize parallelism between devices. Evaluations using real-world traces show L3 achieves up to 6.1$\times$ speedup over state-of-the-art HBM-PIM solutions while significantly improving batch sizes.

Related papers

MemLoRA: Distilling Expert Adapters for On-Device Memory Systems [71.32550994522738]
Memory-augmented Large Language Models (LLMs) have demonstrated remarkable consistency during dialogues.<n>MemLoRA is a novel memory system that integrates small Vision-Language Models.<n>VLM-integrated MemLoRA-V shows massive improvements in caption-based approaches.
arXiv Detail & Related papers (2025-12-04T12:56:30Z)
A Tensor Compiler for Processing-In-Memory Architectures [8.353569627672622]
Processing-In-Memory (PIM) devices can accelerate memory-intensive kernels in Machine Learning (ML) models, including Large Language Models (LLMs)<n>Current compilation approaches lack systematic optimization for diverse ML kernels across multiple PIM backends.<n>We design DCC, the first data-centric ML compiler for PIM systems that jointly co-optimizes data rearrangements and compute code optimization.
arXiv Detail & Related papers (2025-11-19T14:58:16Z)
Sangam: Chiplet-Based DRAM-PIM Accelerator with CXL Integration for LLM Inferencing [2.9665163298601342]
Inference, particularly the decoding phase, is dominated by memory-bound GEMV or flat GEMM operations.<n>Existing in/near-memory solutions face critical limitations such as reduced memory capacity.<n>This work presents a chiplet-based memory module that addresses these limitations.
arXiv Detail & Related papers (2025-11-15T16:39:51Z)
P3-LLM: An Integrated NPU-PIM Accelerator for LLM Inference Using Hybrid Numerical Formats [10.43214279354138]
We introduce P3-LLM, a novel integrated accelerator for inference using hybrid numerical formats.<n>P3-LLM achieves state-of-the-art accuracy in terms of both KV-cache quantization and weight-activation quantization.
arXiv Detail & Related papers (2025-11-10T08:29:34Z)
MEM1: Learning to Synergize Memory and Reasoning for Efficient Long-Horizon Agents [84.62985963113245]
We introduce MEM1, an end-to-end reinforcement learning framework that enables agents to operate with constant memory across long multi-turn tasks.<n>At each turn, MEM1 updates a compact shared internal state that jointly supports memory consolidation and reasoning.<n>We show that MEM1-7B improves performance by 3.5x while reducing memory usage by 3.7x compared to Qwen2.5-14B-Instruct on a 16-objective multi-hop QA task.
arXiv Detail & Related papers (2025-06-18T19:44:46Z)
PAPI: Exploiting Dynamic Parallelism in Large Language Model Decoding with a Processing-In-Memory-Enabled Computing System [13.678531084541666]
We propose PAPI, a PIM-enabled heterogeneous architecture that exploits dynamic scheduling of compute-bound or memory-bound kernels to suitable hardware units.<n>PAPI achieves 1.8$times$ and 11.1$times$ speed over a state-of-the-art heterogeneous accelerator and a state-of-the-art PIM-only accelerator.
arXiv Detail & Related papers (2025-02-21T13:52:31Z)
PRESERVE: Prefetching Model Weights and KV-Cache in Distributed LLM Serving [2.7309692684728613]
Large language models (LLMs) are typically served from clusters of GPU/NPUs that consist of large number of devices.<n>Prior work addressed this issue by overlapping communication with compute, but has severe limitations due to the data dependencies between these operations.<n>We propose PRESERVE, a novel framework that prefetches model weights and KV-cache from off-chip memory to the on-chip cache of AI accelerators.
arXiv Detail & Related papers (2025-01-14T15:14:10Z)
LoL-PIM: Long-Context LLM Decoding with Scalable DRAM-PIM System [6.21613161960432]
Large language models (LLMs) process sequences of tens of thousands of tokens.<n> processing-in-Memory (PIM) maximizes memory bandwidth by moving compute to the data.<n>LoL-PIM is a multi-node PIM architecture that accelerates long context LLM through hardware-software co-design.
arXiv Detail & Related papers (2024-12-28T14:38:16Z)
Look Every Frame All at Once: Video-Ma$^2$mba for Efficient Long-form Video Understanding with Multi-Axis Gradient Checkpointing [52.050036778325094]
Video-Ma$2$mba is a novel architecture that incorporates State Space Models (SSMs) within the Mamba-2 framework. Our approach significantly reduces the memory footprint compared to standard gradient checkpointing. By maintaining a detailed capture of temporal dynamics, our model improves the accuracy and relevance of responses in long video understanding tasks.
arXiv Detail & Related papers (2024-11-29T04:12:13Z)
LiVOS: Light Video Object Segmentation with Gated Linear Matching [116.58237547253935]
LiVOS is a lightweight memory network that employs linear matching via linear attention. For longer and higher-resolution videos, it matched STM-based methods with 53% less GPU memory and supports 4096p inference on a 32G consumer-grade GPU.
arXiv Detail & Related papers (2024-11-05T05:36:17Z)
MARLIN: Mixed-Precision Auto-Regressive Parallel Inference on Large Language Models [58.3342517278868]
This paper describes the design of Mixed-precision AutoRegressive LINear kernels. It shows that batchsizes up to 16-32 can be supported with close to maximum ($4times$) quantization speedup. MarLIN accomplishes this via a combination of techniques, such as asynchronous memory access, complex task scheduling and pipelining.
arXiv Detail & Related papers (2024-08-21T16:10:41Z)
vTensor: Flexible Virtual Tensor Management for Efficient LLM Serving [53.972175896814505]
Large Language Models (LLMs) are widely used across various domains, processing millions of daily requests. Large Language Models (LLMs) are widely used across various domains, processing millions of daily requests.
arXiv Detail & Related papers (2024-07-22T14:37:58Z)
Fast Matrix Multiplications for Lookup Table-Quantized LLMs [58.11584672945781]
FLUTE is a flexible lookup table engine for LUT-quantized LLMs.<n>At batch sizes 32 and quantization group size of 128, the FLUTE kernel can be 2-4x faster than existing GEMM kernels.
arXiv Detail & Related papers (2024-07-15T17:55:42Z)
Memory Is All You Need: An Overview of Compute-in-Memory Architectures for Accelerating Large Language Model Inference [2.9302211589186244]
Large language models (LLMs) have transformed natural language processing, enabling machines to generate human-like text and engage in meaningful conversations. Developments in computing and memory capabilities are lagging behind, exacerbated by the discontinuation of Moore's law. compute-in-memory (CIM) technologies offer a promising solution for accelerating AI inference by directly performing analog computations in memory.
arXiv Detail & Related papers (2024-06-12T16:57:58Z)
PIM-Opt: Demystifying Distributed Optimization Algorithms on a Real-World Processing-In-Memory System [21.09681871279162]
Modern Machine Learning (ML) training on large-scale datasets is a time-consuming workload. It relies on the optimization algorithm Gradient Descent (SGD) due to its effectiveness, simplicity, and generalization performance. processor-centric architectures suffer from low performance and high energy consumption while executing ML training workloads. Processing-In-Memory (PIM) is a promising solution to alleviate the data movement bottleneck.
arXiv Detail & Related papers (2024-04-10T17:00:04Z)
FusionAI: Decentralized Training and Deploying LLMs with Massive Consumer-Level GPUs [57.12856172329322]
We envision a decentralized system unlocking the potential vast untapped consumer-level GPU. This system faces critical challenges, including limited CPU and GPU memory, low network bandwidth, the variability of peer and device heterogeneity.
arXiv Detail & Related papers (2023-09-03T13:27:56Z)
PIM-DRAM:Accelerating Machine Learning Workloads using Processing in Memory based on DRAM Technology [2.6168147530506958]
We propose a processing-in-memory (PIM) multiplication primitive to accelerate matrix vector operations in ML workloads. We show that the proposed architecture, mapping, and data flow can provide up to 23x and 6.5x benefits over a GPU.
arXiv Detail & Related papers (2021-05-08T16:39:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.