Fugu-MT 論文翻訳(概要): KVNAND: Efficient On-Device Large Language Model Inference Using DRAM-Free In-Flash Computing

論文の概要: KVNAND: Efficient On-Device Large Language Model Inference Using DRAM-Free In-Flash Computing

arxiv url: http://arxiv.org/abs/2512.03608v1
Date: Wed, 03 Dec 2025 09:41:03 GMT
ステータス: 翻訳完了
システム内更新日: 2025-12-04 20:02:55.232058
Title: KVNAND: Efficient On-Device Large Language Model Inference Using DRAM-Free In-Flash Computing
Title（参考訳）: KVNAND: DRAMフリーインフラッシュコンピューティングを用いたデバイス上での高効率大規模言語モデル推論
Authors: Lishuo Deng, Shaojie Xu, Jinwu Chen, Changwei Yan, Jiajie Wang, Zhe Jiang, Weiwei Shan,
Abstract要約: エッジデバイス上の大規模言語モデル(LLM)は、強力なプライバシと低コストでパーソナライズされたエージェントを可能にする。数十から数十億のパラメータを持つ単一バッチ自己回帰推論は、非常に低い算術強度に悩まされる。近年のIn-flash Computing (IFC) ソリューションでは,デコードフェーズにおける重み関連線形計算とフラッシュとの併用により,このボトルネックを緩和している。モデル重みとKVキャッシュを完全に計算可能な3D NANDフラッシュに格納する最初のDRAMフリーIFCベースのアーキテクチャであるKVNANDを提案する。
参考スコア（独自算出の注目度）: 6.806071092599333
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Deploying large language models (LLMs) on edge devices enables personalized agents with strong privacy and low cost. However, with tens to hundreds of billions of parameters, single-batch autoregressive inference suffers from extremely low arithmetic intensity, creating severe weight-loading and bandwidth pressures on resource-constrained platforms. Recent in-flash computing (IFC) solutions alleviate this bottleneck by co-locating weight-related linear computations in the decode phase with flash, yet still rely on DRAM for the key-value (KV) cache. As context length grows, the KV cache can exceed model weights in size, imposing prohibitive DRAM cost and capacity requirements. Attempts to offload KV cache to flash suffer from severe performance penalties. We propose KVNAND, the first DRAM-free, IFC-based architecture that stores both model weights and KV cache entirely in compute-enabled 3D NAND flash. KVNAND addresses the fundamental performance challenges of flash under intensive KV cache access by leveraging IFC for all memory-bound operations to reduce data transfer overhead, introducing head-group parallelism to boost throughput, and employing page-level KV cache mapping to align token access patterns with flash organization. In addition, we propose a design space exploration framework that evaluates discrete and compact KVNAND variants to balance weight and KV placement, automatically identifying the optimal design trade-off. These techniques mitigate latency, energy, and reliability concerns, turning flash into a practical medium for long-context KV storage. Evaluations on MHA 7B and GQA 70B LLMs show that KVNAND achieves 1.98\(\times\)/1.94\(\times\)/2.05\(\times\) geomean speedup at 128/1K/10K-token contexts compared to DRAM-equipped IFC designs and addresses out-of-memory failures at 100K context length.
Abstract（参考訳）: エッジデバイスに大規模言語モデル(LLM)をデプロイすることで、強力なプライバシと低コストでパーソナライズされたエージェントが可能になる。しかし、数十億から数十億のパラメータを持つ単一バッチ自己回帰推論は、非常に低い演算強度に悩まされ、リソース制約されたプラットフォームに重み付けと帯域幅の圧力が生じる。近年のIn-flash Computing (IFC) ソリューションは、デコードフェーズにおける重み関連線形計算をフラッシュと組み合わせることで、このボトルネックを軽減するが、キー値(KV)キャッシュのDRAMに依存している。コンテキスト長が大きくなるにつれて、KVキャッシュはモデルの重みを超えることができ、DRAMのコストとキャパシティの要求が禁止される。 KVキャッシュをフラッシュにオフロードしようとする試みは、厳しいパフォーマンス上のペナルティに悩まされる。モデル重みとKVキャッシュを完全に計算可能な3D NANDフラッシュに格納する最初のDRAMフリーIFCベースのアーキテクチャであるKVNANDを提案する。 KVNANDは、すべてのメモリバウンド操作にIFCを活用してデータ転送オーバーヘッドを低減し、スループットを向上させるためにヘッドグループ並列性を導入し、フラッシュ組織とトークンアクセスパターンを整合させるページレベルのKVキャッシュマッピングを採用することで、集中的なKVキャッシュアクセスにおける基本的なパフォーマンス上の課題に対処する。さらに,離散的かつコンパクトなKVNAND変種を重みとKV配置のバランスをとるために評価し,最適設計トレードオフを自動的に識別する設計空間探索フレームワークを提案する。これらの技術は、レイテンシ、エネルギ、信頼性の懸念を緩和し、フラッシュを長期KVストレージの実用的な媒体にする。 MHA 7B と GQA 70B LLM の評価によると、KVNAND は 1.98\(\times\)/1.94\(\times\)/2.05\(\times\) ジオメアン・スピードアップを 18/1K/10K のコンテキストで達成し、DRAM 搭載 IFC の設計と比較し、100K のコンテキスト長でメモリ外障害に対処する。

論文の概要: KVNAND: Efficient On-Device Large Language Model Inference Using DRAM-Free In-Flash Computing

関連論文リスト