Fugu-MT 論文翻訳(概要): Stateful KV Cache Management for LLMs: Balancing Space, Time, Accuracy, and Positional Fidelity

論文の概要: Stateful KV Cache Management for LLMs: Balancing Space, Time, Accuracy, and Positional Fidelity

arxiv url: http://arxiv.org/abs/2511.04686v1
Date: Thu, 23 Oct 2025 18:22:00 GMT
ステータス: 翻訳完了
システム内更新日: 2025-11-16 06:38:30.975441
Title: Stateful KV Cache Management for LLMs: Balancing Space, Time, Accuracy, and Positional Fidelity
Title（参考訳）: LLMのためのステートフルKVキャッシュ管理:空間、時間、精度、位置の忠実さのバランス
Authors: Pratik Poudel,
Abstract要約: キーバリュー(KV)キャッシュは、大規模言語モデル(LLM)における効率的な自己回帰推論に不可欠である本稿では,KVキャッシュ管理戦略とメタラマ/メタラマ-3-8b-インストラクトのようなモデルのアーキテクチャ的コンテキスト制限との相互作用について検討する。
参考スコア（独自算出の注目度）: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The Key-Value (KV) cache is integral to efficient autoregressive inference in large language models (LLMs), yet its unbounded growth in stateful multi-turn scenarios presents major challenges. This paper examines the interplay between KV cache management strategies, the architectural context limits of models like meta-llama/Meta-Llama-3-8b-instruct, and the often-overlooked integrity of positional encodings. Through empirical analysis using a stateful benchmarking framework, we show that LLM generation quality degrades sharply when the accumulated KV cache approaches or exceeds the model's trained context window (e.g., 8192 tokens for Llama 3), a failure mode distinct from GPU memory exhaustion. Common eviction strategies, even high-retention ones (e.g., 99% via AttentionTop), can worsen performance if they disrupt positional coherence. Because LLMs rely on consistent positional signals (e.g., RoPE), compacting a cache by removing non-contiguous tokens can scramble these signals and lead to degenerative outputs. We further show that simple strategies preserving contiguous context blocks (e.g., keeping an initial "gist") can yield more coherent generations than complex or positionally disruptive ones. We advocate for eviction techniques that respect architectural limits, preserve positional structure, and view "cache health" holistically beyond mere size.
Abstract（参考訳）: キーバリュー(KV)キャッシュは、大規模言語モデル(LLM)における効率的な自己回帰推論に不可欠なものだが、ステートフルなマルチターンシナリオにおける非バウンドな成長には大きな課題がある。本稿では,KVキャッシュ管理戦略の相互作用,メタラマ/メタラマ-3-8b-インストラクタのようなモデルのアーキテクチャ的コンテキスト制限,位置エンコーディングの整合性について検討する。ステートフルなベンチマークフレームワークを用いた実証分析により,蓄積したKVキャッシュがモデルのトレーニング済みコンテキストウインドウ(Llama 3の8192トークンなど)に近づくと,LCM生成品質が急激に低下することを示した。一般的な排除戦略(例えば、AttentionTop経由で99%)は、位置コヒーレンスを乱すとパフォーマンスが悪化する。 LLMは一貫した位置信号(例えばRoPE)に依存しているため、不連続なトークンを除去することでキャッシュをコンパクト化することで、これらの信号をスクランブルし、退化出力につながる。さらに、連続したコンテキストブロック(例えば、初期"gist"を維持する)を保存する単純な戦略は、複雑なものや、位置的に破壊的なものよりも、より一貫性のある世代を生み出すことが示される。我々は,建築的限界を尊重し,位置的構造を保ち,ただの規模を超えた「健康」を均等に見るような排除手法を提唱する。

論文の概要: Stateful KV Cache Management for LLMs: Balancing Space, Time, Accuracy, and Positional Fidelity

関連論文リスト