Fugu-MT 論文翻訳(概要): Agent Memory Below the Prompt: Persistent Q4 KV Cache for Multi-Agent LLM Inference on Edge Devices

論文の概要: Agent Memory Below the Prompt: Persistent Q4 KV Cache for Multi-Agent LLM Inference on Edge Devices

arxiv url: http://arxiv.org/abs/2603.04428v1
Date: Tue, 17 Feb 2026 05:46:20 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-09 01:20:08.206987
Title: Agent Memory Below the Prompt: Persistent Q4 KV Cache for Multi-Agent LLM Inference on Edge Devices
Title（参考訳）: プロンプト以下のエージェントメモリ:エッジデバイス上でのマルチエージェントLDM推論のための永続Q4KVキャッシュ
Authors: Yakov Pyotr Shkolnikov,
Abstract要約: エッジデバイス上のマルチエージェントLLMシステムは、メモリ管理の問題に直面している。 10.2GBのキャッシュ予算を持つApple M4 Proでは、FP16の8Kコンテキストに適合するエージェントは3つしかない。我々は、各エージェントのKVキャッシュを4ビットの量子化フォーマットでディスクに永続化することで、この問題に対処する。
参考スコア（独自算出の注目度）: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Multi-agent LLM systems on edge devices face a memory management problem: device RAM is too small to hold every agent's KV cache simultaneously. On Apple M4 Pro with 10.2 GB of cache budget, only 3 agents fit at 8K context in FP16. A 10-agent workflow must constantly evict and reload caches. Without persistence, every eviction forces a full re-prefill through the model -- 15.7 seconds per agent at 4K context. We address this by persisting each agent's KV cache to disk in 4-bit quantized format and reloading it directly into the attention layer, eliminating redundant O(n) prefill computation via direct cache restoration. The system comprises three components: a block pool providing per-agent isolated Q4 KV caches in safetensors format, a BatchQuantizedKVCache for concurrent inference over multiple agents' quantized caches, and cross-phase context injection that accumulates attention state across conversation phases without re-computation. Evaluated on three architectures (Gemma 3 12B, dense GQA, 48 layers; DeepSeek-Coder-V2-Lite 16B, MoE MLA, 27 layers; Llama 3.1 8B, dense GQA, 32 layers), cache restoration reduces time-to-first-token by up to 136x (Gemma: 22--136x at 4K--32K; DeepSeek: 11--76x at 4K--32K; Llama: 24--111x at 4K--16K; 3--10x at 1K). Q4 quantization fits 4x more agent contexts into fixed device memory than FP16. Perplexity measured with actual Q4 KV caches shows -0.7% for Gemma, +2.8% for Llama, and +3.0% for DeepSeek. Open-source at https://github.com/yshk-mxim/agent-memory
Abstract（参考訳）: エッジデバイス上のマルチエージェントLLMシステムは、メモリ管理の問題に直面している: デバイスRAMは、すべてのエージェントのKVキャッシュを同時に保持するには小さすぎる。 10.2GBのキャッシュ予算を持つApple M4 Proでは、FP16の8Kコンテキストに適合するエージェントは3つしかない。 10エージェントのワークフローは、キャッシュを常に削除し、再ロードする必要があります。パーシステンスなしでは、各エビジョンは4Kコンテキストで1エージェントあたり15.7秒の完全な再準備を強制する。我々は,各エージェントのKVキャッシュを4ビットの量子化形式でディスクに永続化し,それをアテンション層に直接再ロードすることで,直接キャッシュ復元による冗長なO(n)プリフィル計算を不要にすることで,この問題に対処する。システムは、3つのコンポーネントから構成される: エージェントごとに分離されたQ4KVキャッシュをセーフテンソルフォーマットで提供するブロックプール、複数のエージェントの量子化されたキャッシュを同時に推論するためのBatchQuantizedKVCache、再計算せずに会話フェーズ間で注意状態を蓄積するクロスフェーズコンテキストインジェクション。 3つのアーキテクチャ(Gemma 3 12B、高密度GQA、48層、DeepSeek-Coder-V2-Lite 16B、MoE MLA、27層、Llama 3.1 8B、高密度GQA、32層)で評価され、キャッシュの復元により最大136倍(Gemma: 22--136x at 4K-32K、DeepSeek: 11-76x at 4K-32K、Llama: 24--111x at 4K-16K、 3--10x at 1K)まで短縮される。 Q4量子化はFP16よりも4倍多くのエージェントコンテキストを固定デバイスメモリに適合させる。実際のQ4 KVキャッシュで測定された複雑さは、Gemmaが-0.7%、Llamaが+2.8%、DeepSeekが+3.0%である。 https://github.com/yshk-mxim/agent-Memoryのオープンソース

論文の概要: Agent Memory Below the Prompt: Persistent Q4 KV Cache for Multi-Agent LLM Inference on Edge Devices

関連論文リスト