Fugu-MT 論文翻訳(概要): HALO: Memory-Centric Heterogeneous Accelerator with 2.5D Integration for Low-Batch LLM Inference

論文の概要: HALO: Memory-Centric Heterogeneous Accelerator with 2.5D Integration for Low-Batch LLM Inference

arxiv url: http://arxiv.org/abs/2510.02675v1
Date: Fri, 03 Oct 2025 02:20:17 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-06 16:35:52.240002
Title: HALO: Memory-Centric Heterogeneous Accelerator with 2.5D Integration for Low-Batch LLM Inference
Title（参考訳）: HALO:低バッチLDM推論のための2.5D統合メモリ中心不均質加速器
Authors: Shubham Negi, Kaushik Roy,
Abstract要約: 大きな言語モデル(LLM)は、レイテンシに敏感なアプリケーションにおける効率的な推論の需要を増大させた。これらの課題に対するヘテロジニアスメモリ中心のアクセラレータであるHALOを提案する。 HALOはAtAccの最大18倍の幾何平均速度を達成し,注目度を最適化したマッピング,CENTの2.5倍を実現している。
参考スコア（独自算出の注目度）: 8.057006406834462
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: The rapid adoption of Large Language Models (LLMs) has driven a growing demand for efficient inference, particularly in latency-sensitive applications such as chatbots and personalized assistants. Unlike traditional deep neural networks, LLM inference proceeds in two distinct phases: the prefill phase, which processes the full input sequence in parallel, and the decode phase, which generates tokens sequentially. These phases exhibit highly diverse compute and memory requirements, which makes accelerator design particularly challenging. Prior works have primarily been optimized for high-batch inference or evaluated only short input context lengths, leaving the low-batch and long context regime, which is critical for interactive applications, largely underexplored. We propose HALO, a heterogeneous memory centric accelerator designed for these unique challenges of prefill and decode phases in low-batch LLM inference. HALO integrates HBM based Compute-in-DRAM (CiD) with an on-chip analog Compute-in-Memory (CiM), co-packaged using 2.5D integration. To further improve the hardware utilization, we introduce a phase-aware mapping strategy that adapts to the distinct demands of the prefill and decode phases. Compute bound operations in the prefill phase are mapped to CiM to exploit its high throughput matrix multiplication capability, while memory-bound operations in the decode phase are executed on CiD to benefit from reduced data movement within DRAM. Additionally, we present an analysis of the performance tradeoffs of LLMs under two architectural extremes: a fully CiD and a fully on-chip analog CiM design to highlight the need for a heterogeneous design. We evaluate HALO on LLaMA-2 7B and Qwen3 8B models. Our experimental results show that LLMs mapped to HALO achieve up to 18x geometric mean speedup over AttAcc, an attention-optimized mapping and 2.5x over CENT, a fully CiD based mapping.
Abstract（参考訳）: 大規模言語モデル(LLM)の急速な採用は、特にチャットボットやパーソナライズされたアシスタントのような遅延に敏感なアプリケーションにおいて、効率的な推論の需要を増大させた。従来のディープニューラルネットワークとは異なり、LLM推論は、完全な入力シーケンスを並列に処理するプリフィルフェーズと、トークンを逐次生成するデコードフェーズの2つの異なるフェーズで進行する。これらのフェーズは、非常に多様な計算およびメモリ要件を示しており、特にアクセラレータ設計が困難である。従来の作業は、主にハイバッチの推論に最適化されたり、短い入力コンテキストの長さだけを評価されたりし、低バッチで長いコンテキスト構造を残した。低バッチLLM推論における相のプリフィルと復号化という,これらのユニークな課題に対して設計したヘテロジニアスメモリ中心のアクセラレータHALOを提案する。 HALOはHBMベースのCompute-in-DRAM(CiD)とオンチップのアナログCompute-in-Memory(CiM)を統合し、2.5Dを統合した。ハードウェア利用のさらなる向上のために、プリフィルおよびデコードフェーズの異なる要求に適応する位相対応マッピング戦略を導入する。プリフィル相の計算バウンド演算は、その高いスループット行列乗算能力を利用するためにCiMにマッピングされ、デコード相のメモリバウンド演算はCiD上で実行され、DRAM内のデータ移動の減少の恩恵を受ける。さらに,LLMの性能的トレードオフを,完全CiDと完全オンチップアナログCiM設計という2つのアーキテクチャ上の極端条件下で解析し,不均一な設計の必要性を強調した。 LLaMA-2 7BおよびQwen3 8Bモデル上でHALOを評価する。実験の結果, HALOにマッピングされたLLMは, AttAccの18倍の幾何平均速度アップ, 注意最適化マッピング, CENTの2.5倍の速度アップを実現していることがわかった。

論文の概要: HALO: Memory-Centric Heterogeneous Accelerator with 2.5D Integration for Low-Batch LLM Inference

関連論文リスト