Fugu-MT 論文翻訳(概要): Sangam: Chiplet-Based DRAM-PIM Accelerator with CXL Integration for LLM Inferencing

論文の概要: Sangam: Chiplet-Based DRAM-PIM Accelerator with CXL Integration for LLM Inferencing

arxiv url: http://arxiv.org/abs/2511.12286v1
Date: Sat, 15 Nov 2025 16:39:51 GMT
ステータス: 翻訳完了
システム内更新日: 2025-11-18 14:36:23.802299
Title: Sangam: Chiplet-Based DRAM-PIM Accelerator with CXL Integration for LLM Inferencing
Title（参考訳）: Sangam: LLM推論のためのCXL統合によるチップレットベースのDRAM-PIM加速器
Authors: Khyati Kiyawat, Zhenxing Fan, Yasas Seneviratne, Morteza Baradaran, Akhil Shekar, Zihan Xia, Mingu Kang, Kevin Skadron,
Abstract要約: 推論、特にデコードフェーズは、メモリバウンドGEMVまたはフラットGEMM操作によって支配される。既存のインメモリソリューションは、メモリ容量の削減などの限界に直面している。この作業は、これらの制限に対処するチップレットベースのメモリモジュールを提供する。
参考スコア（独自算出の注目度）: 2.9665163298601342
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models (LLMs) are becoming increasingly data-intensive due to growing model sizes, and they are becoming memory-bound as the context length and, consequently, the key-value (KV) cache size increase. Inference, particularly the decoding phase, is dominated by memory-bound GEMV or flat GEMM operations with low operational intensity (OI), making it well-suited for processing-in-memory (PIM) approaches. However, existing in/near-memory solutions face critical limitations such as reduced memory capacity due to the high area cost of integrating processing elements (PEs) within DRAM chips, and limited PE capability due to the constraints of DRAM fabrication technology. This work presents a chiplet-based memory module that addresses these limitations by decoupling logic and memory into chiplets fabricated in heterogeneous technology nodes and connected via an interposer. The logic chiplets sustain high bandwidth access to the DRAM chiplets, which house the memory banks, and enable the integration of advanced processing components such as systolic arrays and SRAM-based buffers to accelerate memory-bound GEMM kernels, capabilities that were not feasible in prior PIM architectures. We propose Sangam, a CXL-attached PIM-chiplet based memory module that can either act as a drop-in replacement for GPUs or co-executes along side the GPUs. Sangam achieves speedup of 3.93, 4.22, 2.82x speedup in end-to-end query latency, 10.3, 9.5, 6.36x greater decoding throughput, and order of magnitude energy savings compared to an H100 GPU for varying input size, output length, and batch size on LLaMA 2-7B, Mistral-7B, and LLaMA 3-70B, respectively.
Abstract（参考訳）: 大規模言語モデル(LLM)は、モデルサイズの増加によりデータ集約化が進み、コンテキスト長としてメモリバウンドになり、結果としてキー値(KV)キャッシュサイズが増加する。推論、特にデコーディングフェーズは、メモリバウンドGEMVまたは低演算強度(OI)のフラットGEMM操作によって支配され、PIM(Process-in-Memory)アプローチに適している。しかし、既存のインメモリソリューションでは、DRAMチップに処理要素(PE)を統合するための高コストなメモリ容量の削減や、DRAM製造技術の制約によるPE能力の制限など、限界に直面している。この研究は、論理とメモリを異種技術ノードで製造され、インターポーザを介して接続されるチップレットに分離することで、これらの制限に対処するチップレットベースのメモリモジュールを示す。論理チップレットは、メモリバンクを格納するDRAMチップレットへの高帯域幅アクセスを保持し、シストリックアレイやSRAMベースのバッファなどの高度な処理コンポーネントを統合して、メモリバウンドGEMMカーネルを高速化する。我々は、CXL対応のPIMチップベースのメモリモジュールであるSangamを提案する。 Sangamは、エンドツーエンドのクエリ待ち時間における3.93, 4.22, 2.82倍のスピードアップ、10.3, 9.5, 6.36倍のデコードスループット、およびLLaMA 2-7B、Mistral-7B、LLaMA 3-70Bでそれぞれ入力サイズ、出力長、バッチサイズを変化させるH100 GPUと比較して、等級の省エネを達成している。

論文の概要: Sangam: Chiplet-Based DRAM-PIM Accelerator with CXL Integration for LLM Inferencing

関連論文リスト