Fugu-MT 論文翻訳(概要): Scalable Processing-Near-Memory for 1M-Token LLM Inference: CXL-Enabled KV-Cache Management Beyond GPU Limits

論文の概要: Scalable Processing-Near-Memory for 1M-Token LLM Inference: CXL-Enabled KV-Cache Management Beyond GPU Limits

arxiv url: http://arxiv.org/abs/2511.00321v1
Date: Fri, 31 Oct 2025 23:50:44 GMT
ステータス: 翻訳完了
システム内更新日: 2025-11-05 16:37:26.71549
Title: Scalable Processing-Near-Memory for 1M-Token LLM Inference: CXL-Enabled KV-Cache Management Beyond GPU Limits
Title（参考訳）: 100MToken LLM推論のためのスケーラブル処理-Near-Memory: CXL-Enabled KV-Cache Management over GPU Limits
Authors: Dowon Kim, MinJae Lee, Janghyeon Kim, HyuckSung Kwon, Hyeonggyu Jeong, Sang-Soo Park, Minyong Yoon, Si-Dong Roh, Yongsuk Kwon, Jinin So, Jungwook Choi,
Abstract要約: 本研究は,1M-Token LLM推論のためのスケーラブル処理-Near-Memory(PNM)を提案する。我々のソリューションは最大405Bのパラメータと1Mのコンテキストを持つLLMに対して一貫した性能向上を提供する。
参考スコア（独自算出の注目度）: 6.833710057939837
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The expansion of context windows in large language models (LLMs) to multi-million tokens introduces severe memory and compute bottlenecks, particularly in managing the growing Key-Value (KV) cache. While Compute Express Link (CXL) enables non-eviction frameworks that offload the full KV-cache to scalable external memory, these frameworks still suffer from costly data transfers when recalling non-resident KV tokens to limited GPU memory as context lengths increase. This work proposes scalable Processing-Near-Memory (PNM) for 1M-Token LLM Inference, a CXL-enabled KV-cache management system that coordinates memory and computation beyond GPU limits. Our design offloads token page selection to a PNM accelerator within CXL memory, eliminating costly recalls and enabling larger GPU batch sizes. We further introduce a hybrid parallelization strategy and a steady-token selection mechanism to enhance compute efficiency and scalability. Implemented atop a state-of-the-art CXL-PNM system, our solution delivers consistent performance gains for LLMs with up to 405B parameters and 1M-token contexts. Our PNM-only offloading scheme (PNM-KV) and GPU-PNM hybrid with steady-token execution (PnG-KV) achieve up to 21.9x throughput improvement, up to 60x lower energy per token, and up to 7.3x better total cost efficiency than the baseline, demonstrating that CXL-enabled multi-PNM architectures can serve as a scalable backbone for future long-context LLM inference.
Abstract（参考訳）: 大規模言語モデル(LLM)から数百万のトークンへのコンテキストウィンドウの拡張は、特に増大するキーバリュー(KV)キャッシュの管理において、深刻なメモリと計算のボトルネックをもたらす。 Compute Express Link(CXL)は、完全なKVキャッシュを拡張性のある外部メモリにオフロードするノンエビクションフレームワークを可能にするが、コンテキスト長が増加するにつれて、非レジデントなKVトークンをリコールする場合、これらのフレームワークはコストのかかるデータ転送に悩まされる。この研究は、1M-Token LLM推論のためのスケーラブルなプロセッシング・ナアー・メモリ(PNM)を提案する。我々の設計ではトークンページの選択をCXLメモリ内のPNMアクセラレータにオフロードし、コストのかかるリコールを排除し、GPUバッチサイズを大きくする。さらに、計算効率とスケーラビリティを向上させるために、ハイブリッド並列化戦略と定常的な選択機構を導入する。我々のソリューションは最先端のCXL-PNMシステム上に実装され、最大405Bのパラメータと1Mのコンテキストを持つLLMに対して一貫した性能向上を提供する。我々のPNMオンリーオフローディングスキーム(PNM-KV)とGPU-PNMハイブリット(PnG-KV)は、21.9倍のスループット向上、トークン当たりの60倍のエネルギー、ベースラインよりも7.3倍のコスト効率を実現し、CXL対応マルチPNMアーキテクチャが将来の長期LLM推論のスケーラブルなバックボーンとして機能することを実証した。

論文の概要: Scalable Processing-Near-Memory for 1M-Token LLM Inference: CXL-Enabled KV-Cache Management Beyond GPU Limits

関連論文リスト