Fugu-MT 論文翻訳(概要): RedKnot: Efficient Long-Context LLM Serving with Head-Aware KV Reuse and SegPagedAttention

論文の概要: RedKnot: Efficient Long-Context LLM Serving with Head-Aware KV Reuse and SegPagedAttention

arxiv url: http://arxiv.org/abs/2606.06256v1
Date: Thu, 04 Jun 2026 14:57:07 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-05 22:39:44.880798
Title: RedKnot: Efficient Long-Context LLM Serving with Head-Aware KV Reuse and SegPagedAttention
Title（参考訳）: RedKnot: ヘッドアウェアKV再利用とSegPagedAttentionを併用した高効率LLM
Authors: Yang Liu, ZhaoKai Luo, HuaYi Jin, ZhiYong Wang, RuoZhou He, BoYu Wang, Guanjie Chen, Junhao Hu,
Abstract要約: LLMサービスのためのヘッドアウェアKVキャッシュ管理システムであるRedKnotを提案する。 RedKnotは、KVヘッドに沿ってKVキャッシュを分解することで、従来のモノリシックなKVキャッシュの抽象化を破る。
参考スコア（独自算出の注目度）: 20.633983983180812
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: As the input length of large language model (LLM) serving continues to grow, the KV cache has become a dominant bottleneck in AI infrastructure. It limits GPU memory capacity, serving concurrency, cache reuse, and distributed scalability. Several important problems, including position-independent KV cache, prefix KV cache compression, hot/cold KV cache separation, and distributed KV cache management, all depend on how the KV cache is represented and managed. However, existing serving systems largely rely on a monolithic KV cache abstraction, where the KV cache is treated as a homogeneous sequence of token-level memory blocks and managed with similar policies across attention heads and serving scenarios. We observe that KV cache utility is highly structured across KV heads: different heads exhibit different functional roles, attention distances, and runtime importance. Therefore, a full KV cache is not always necessary for every head, token range, or serving scenario. We present RedKnot, a head-aware KV cache management system for LLM serving. RedKnot breaks the conventional monolithic KV cache abstraction by decomposing the KV cache along KV heads, whose importance and effective attention ranges vary significantly across serving scenarios. This head-level decomposition turns the KV cache from a monolithic tensor abstraction into a structured memory object, enabling RedKnot to uniformly support position-independent KV reuse, prefix KV compression, hot/cold KV separation, and distributed KV placement while preserving output fidelity and improving resource efficiency, without requiring model retraining or fine-tuning. RedKnot establishes a new foundation for AI infrastructure by transforming the KV cache from a monolithic, passive runtime artifact into a dynamic, model-aware runtime substrate for scalable LLM serving.
Abstract（参考訳）: 大規模言語モデル(LLM)の入力長が増加し続けており、KVキャッシュはAIインフラストラクチャにおいて主要なボトルネックとなっている。これにより、GPUメモリ容量、並行処理、キャッシュ再利用、分散スケーラビリティが制限される。位置に依存しないKVキャッシュ、プレフィックスKVキャッシュ圧縮、ホット/コールドKVキャッシュ分離、分散KVキャッシュ管理といった重要な問題は、すべてKVキャッシュの表現と管理方法に依存する。しかし、既存のサービスシステムはモノリシックなKVキャッシュの抽象化に大きく依存しており、KVキャッシュはトークンレベルのメモリブロックの均一なシーケンスとして扱われ、アテンションヘッドやサービスシナリオにまたがる同様のポリシーで管理される。我々は、KVキャッシュユーティリティがKVヘッドにまたがって高度に構造化されていることを観察した。したがって、すべてのヘッド、トークン範囲、あるいはサービスシナリオに対して、完全なKVキャッシュは必ずしも必要ではない。 LLMサービスのためのヘッドアウェアKVキャッシュ管理システムであるRedKnotを提案する。 RedKnotは、KVヘッドに沿ってKVキャッシュを分解することで、従来のモノリシックなKVキャッシュの抽象化を破る。このヘッドレベル分解は、KVキャッシュをモノリシックテンソル抽象から構造化メモリオブジェクトに変換し、モデル再構成や微調整を必要とせず、出力の忠実さを保ちながら、位置独立KV再利用、プレフィックスKV圧縮、ホット/コールドKV分離、分散KV配置を均一にサポートできるようにする。 RedKnotは、KVキャッシュをモノリシックでパッシブなランタイムアーティファクトから、スケーラブルなLLMサービスのための動的モデル対応ランタイム基板に変換することで、AIインフラストラクチャの新たな基盤を確立する。

論文の概要: RedKnot: Efficient Long-Context LLM Serving with Head-Aware KV Reuse and SegPagedAttention

関連論文リスト