Fugu-MT 論文翻訳(概要): AQPIM: Breaking the PIM Capacity Wall for LLMs with In-Memory Activation Quantization

論文の概要: AQPIM: Breaking the PIM Capacity Wall for LLMs with In-Memory Activation Quantization

arxiv url: http://arxiv.org/abs/2604.18137v1
Date: Mon, 20 Apr 2026 12:04:51 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-21 21:52:52.846414
Title: AQPIM: Breaking the PIM Capacity Wall for LLMs with In-Memory Activation Quantization
Title（参考訳）: AQPIM:インメモリアクティベーション量子化によるLCM用PIM容量壁の破壊
Authors: Kosuke Matsushima, Yasuyuki Okoshi, Masato Motomura, Daichi Fujiki,
Abstract要約: AQPIMは製品量子化(PQ)に基づくPIM対応アクティベーション量子化フレームワークである圧縮されたデータの直接計算を可能にし、注意計算のためのメモリフットプリントと計算オーバーヘッドの両方を大幅に削減する。
参考スコア（独自算出の注目度）: 1.5897138572815364
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Processing-in-Memory (PIM) architectures offer a promising solution to the memory bottlenecks in data-intensive machine learning, yet often overlook the growing challenge of activation memory footprint. Conventional PIM approaches struggle with massive KV cache sizes generated in long-context scenarios by Transformer-based models, frequently exceeding PIM's limited memory capacity, while techniques like sparse attention can conflict with PIM's need for data locality. Existing PIM approaches and quantization methods are often insufficient or poorly suited for leveraging the unique characteristics of activations. This work identifies an opportunity for PIM-specialized activation quantization to enhance bandwidth and compute efficiency. We explore clustering-based vector quantization approaches, which align well with activation characteristics and PIM's internal bandwidth capabilities. Building on this, we introduce AQPIM, a novel PIM-aware activation quantization framework based on Product Quantization (PQ), optimizing it for modern Large Language Models (LLMs). By performing quantization directly within memory, AQPIM leverages PIM's high internal bandwidth and enables direct computation on compressed data, significantly reducing both memory footprint and computational overhead for attention computation. AQPIM addresses PQ's accuracy challenges by introducing several algorithmic optimizations. Evaluations demonstrate that AQPIM achieves significant performance improvements, drastically reducing of GPU-CPU communication that can account for 90$\sim$98.5\% of decoding latency, together with 3.4$\times$ speedup over a SOTA PIM approach.
Abstract（参考訳）: プロセッシング・イン・メモリ(PIM)アーキテクチャは、データ集約型機械学習におけるメモリボトルネックに対する有望な解決策を提供するが、アクティベーションメモリフットプリントの増大する課題を見落としていることが多い。従来のPIMアプローチでは、Transformerベースのモデルによって長いコンテキストシナリオで生成される大量のKVキャッシュサイズに苦労する。既存のPIMアプローチや量子化法は、アクティベーションのユニークな特性を利用するのに不十分または不適であることが多い。この研究は、帯域幅と計算効率を高めるために、PIMに特化されたアクティベーション量子化の機会を特定する。本稿では,クラスタリングに基づくベクトル量子化手法について検討する。製品量子化(PQ)に基づく新しいPIM対応アクティベーション量子化フレームワークであるAQPIMを導入し,それをLLM(Large Language Models)に最適化する。メモリ内で直接量子化を行うことにより、AQPIMはPIMの高内部帯域幅を活用し、圧縮されたデータの直接計算を可能にし、メモリフットプリントとアテンション計算の計算オーバーヘッドを大幅に削減する。 AQPIMは、いくつかのアルゴリズム最適化を導入することで、PQの精度の問題に対処する。 AQPIMは、90$\sim$98.5\%のデコード遅延を、SOTA PIMアプローチの3.4$\times$スピードアップとともに、GPUとCPUの通信を大幅に削減するなど、大幅なパフォーマンス向上を実現している。

論文の概要: AQPIM: Breaking the PIM Capacity Wall for LLMs with In-Memory Activation Quantization

関連論文リスト