CHIME: Chiplet-based Heterogeneous Near-Memory Acceleration for Edge Multimodal LLM Inference
- URL: http://arxiv.org/abs/2601.19908v1
- Date: Fri, 12 Dec 2025 03:59:36 GMT
- Title: CHIME: Chiplet-based Heterogeneous Near-Memory Acceleration for Edge Multimodal LLM Inference
- Authors: Yanru Chen, Runyang Tian, Yue Pan, Zheyu Li, Weihong Xu, Tajana Rosing,
- Abstract summary: We present CHIME, a chiplet-based heterogeneous near-memory acceleration for edge MLLMs inference.<n>On FastVLM (0.6B/1.7B) and MobileVLM (1.7B/3B), CHIME achieves up to 54x speedup and up to 246x better energy efficiency per inference.
- Score: 19.989162649002274
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The proliferation of large language models (LLMs) is accelerating the integration of multimodal assistants into edge devices, where inference is executed under stringent latency and energy constraints, often exacerbated by intermittent connectivity. These challenges become particularly acute in the context of multimodal LLMs (MLLMs), as high-dimensional visual inputs are transformed into extensive token sequences, thereby inflating the key-value (KV) cache and imposing substantial data movement overheads to the LLM backbone. To address these issues, we present CHIME, a chiplet-based heterogeneous near-memory acceleration for edge MLLMs inference. CHIME leverages the complementary strengths of integrated monolithic 3D (M3D) DRAM and RRAM chiplets: DRAM supplies low-latency bandwidth for attention, while RRAM offers dense, non-volatile storage for weights. This heterogeneous hardware is orchestrated by a co-designed mapping framework that executes fused kernels near data, minimizing cross-chiplet traffic to maximize effective bandwidth. On FastVLM (0.6B/1.7B) and MobileVLM (1.7B/3B), CHIME achieves up to 54x speedup and up to 246x better energy efficiency per inference as compared to the edge GPU NVIDIA Jetson Orin NX. It sustains 116.5-266.5 token/J compared to Jetson's 0.7-1.1 token/J. Furthermore, it delivers up to 69.2x higher throughput than the state-of-the-art PIM accelerator FACIL. Compared to the M3D DRAM-only design, CHIME's heterogeneous memory further improves energy efficiency by 7% and performance by 2.4x.
Related papers
- MiniCPM-SALA: Hybridizing Sparse and Linear Attention for Efficient Long-Context Modeling [80.48332380100915]
MiniCPM-SALA is a hybrid model that integrates the high-fidelity long-context modeling of sparse attention with the global efficiency of linear attention.<n>On a single NVIDIA A6000D GPU, the model achieves up to 3.5x the inference speed of the full-attention model at the sequence length of 256K tokens.
arXiv Detail & Related papers (2026-02-12T09:37:05Z) - VidLaDA: Bidirectional Diffusion Large Language Models for Efficient Video Understanding [52.69880888587866]
Current Video Large Language Models (Video LLMs) typically encode frames via a encoder vision and employ an autoregressive (AR) LLM for understanding and generation.<n>We propose VidLaDA, a Diffusion Video LLM based on Language Models (DLMs) that leverages bidirectional attention to unlock comprehensive modeling and decode tokens in parallel.<n>Experiments show VidLaDA rivals state-of-the-art AR baselines and outperforms DLM baselines, with MARS-Cache delivering over 12x speedup without compromising accuracy.
arXiv Detail & Related papers (2026-01-25T15:02:01Z) - InfiniteVL: Synergizing Linear and Sparse Attention for Highly-Efficient, Unlimited-Input Vision-Language Models [49.08289742711585]
We propose InfiniteVL, a linear-complexity VLM architecture that synergizes sliding window attention (SWA) with Gated DeltaNet.<n>We show that InfiniteVL achieves over 3.6times inference speedup while maintaining constant latency and memory footprint.<n>In streaming video understanding scenarios, it sustains a stable 24 FPS real-time prefill speed while preserving long-term memory cache.
arXiv Detail & Related papers (2025-12-09T17:18:32Z) - Sangam: Chiplet-Based DRAM-PIM Accelerator with CXL Integration for LLM Inferencing [2.9665163298601342]
Inference, particularly the decoding phase, is dominated by memory-bound GEMV or flat GEMM operations.<n>Existing in/near-memory solutions face critical limitations such as reduced memory capacity.<n>This work presents a chiplet-based memory module that addresses these limitations.
arXiv Detail & Related papers (2025-11-15T16:39:51Z) - Kelle: Co-design KV Caching and eDRAM for Efficient LLM Serving in Edge Computing [9.984481065465028]
Large Language Models (LLMs) on edge devices are crucial for reducing latency, improving real-time processing, and enhancing privacy.<n> implementing LLMs on edge devices presents challenges, particularly with managing key-value caches.<n>We propose eDRAM as the primary storage for LLM serving in edge device, which offers higher density compared to storage.
arXiv Detail & Related papers (2025-10-16T07:12:08Z) - Stratum: System-Hardware Co-Design with Tiered Monolithic 3D-Stackable DRAM for Efficient MoE Serving [24.320791041324316]
Stratum is a system-hardware co-design approach that combines the novel memory technology Monolithic 3D-Stackable DRAM (Mono3D DRAM), near-memory processing (NMP), and GPU acceleration.<n>System achieves up to 8.29x improvement in decoding throughput and 7.66x better energy efficiency across various benchmarks compared to GPU baselines.
arXiv Detail & Related papers (2025-10-06T18:09:47Z) - Hybrid Systolic Array Accelerator with Optimized Dataflow for Edge Large Language Model Inference [8.475319961845903]
Edge accelerator should achieve high area efficiency and minimize external memory access.<n>This paper proposes an edge LLM inference accelerator featuring a hybrid systolic array architecture.<n>Our solution achieves 247/117 (token/s/mm2) while running a 1.3B LLM on long-input/long-output scenarios.
arXiv Detail & Related papers (2025-07-11T20:27:30Z) - L3: DIMM-PIM Integrated Architecture and Coordination for Scalable Long-Context LLM Inference [6.886434948681708]
Large Language Models (LLMs) increasingly require processing long text sequences, but GPU memory limitations force difficult trade-offs between memory capacity and bandwidth.<n>We identify that the critical memory bottleneck lies in the decoding phase of multi-head attention.<n>We propose L3, a hardware-software co-designed system integrating DIMM-PIM and GPU devices.
arXiv Detail & Related papers (2025-04-24T14:14:07Z) - LiVOS: Light Video Object Segmentation with Gated Linear Matching [116.58237547253935]
LiVOS is a lightweight memory network that employs linear matching via linear attention.
For longer and higher-resolution videos, it matched STM-based methods with 53% less GPU memory and supports 4096p inference on a 32G consumer-grade GPU.
arXiv Detail & Related papers (2024-11-05T05:36:17Z) - vTensor: Flexible Virtual Tensor Management for Efficient LLM Serving [53.972175896814505]
Large Language Models (LLMs) are widely used across various domains, processing millions of daily requests.
Large Language Models (LLMs) are widely used across various domains, processing millions of daily requests.
arXiv Detail & Related papers (2024-07-22T14:37:58Z) - Edge-MoE: Memory-Efficient Multi-Task Vision Transformer Architecture
with Task-level Sparsity via Mixture-of-Experts [60.1586169973792]
M$3$ViT is the latest multi-task ViT model that introduces mixture-of-experts (MoE)
MoE achieves better accuracy and over 80% reduction computation but leaves challenges for efficient deployment on FPGA.
Our work, dubbed Edge-MoE, solves the challenges to introduce the first end-to-end FPGA accelerator for multi-task ViT with a collection of architectural innovations.
arXiv Detail & Related papers (2023-05-30T02:24:03Z) - EfficientViT: Multi-Scale Linear Attention for High-Resolution Dense
Prediction [67.11722682878722]
This work presents EfficientViT, a new family of high-resolution vision models with novel multi-scale linear attention.
Our multi-scale linear attention achieves the global receptive field and multi-scale learning.
EfficientViT delivers remarkable performance gains over previous state-of-the-art models.
arXiv Detail & Related papers (2022-05-29T20:07:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.