Fugu-MT 論文翻訳(概要): KVCompose: Efficient Structured KV Cache Compression with Composite Tokens

論文の概要: KVCompose: Efficient Structured KV Cache Compression with Composite Tokens

arxiv url: http://arxiv.org/abs/2509.05165v1
Date: Fri, 05 Sep 2025 14:58:24 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-08 14:27:25.629476
Title: KVCompose: Efficient Structured KV Cache Compression with Composite Tokens
Title（参考訳）: KVCompose: 複合トークンを用いた効率的な構造化KVキャッシュ圧縮
Authors: Dmitry Akulov, Mohamed Sana, Antonio De Domenico, Tareq Si Salem, Nicola Piovesan, Fadhel Ayed,
Abstract要約: 大規模言語モデル(LLM)は、効率的な自己回帰復号化のためにキー値(KV)キャッシュに依存している。我々は,注意誘導型,層適応型複合トークンに基づく,シンプルで効果的なKVキャッシュ圧縮フレームワークを提案する。本手法は精度を保ちながらメモリの大幅な削減を実現し,従来手法と半構造化手法を一貫して上回っている。
参考スコア（独自算出の注目度）: 7.922206020386125
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) rely on key-value (KV) caches for efficient autoregressive decoding; however, cache size grows linearly with context length and model depth, becoming a major bottleneck in long-context inference. Prior KV cache compression methods either enforce rigid heuristics, disrupt tensor layouts with per-attention-head variability, or require specialized compute kernels. We propose a simple, yet effective, KV cache compression framework based on attention-guided, layer-adaptive composite tokens. Our method aggregates attention scores to estimate token importance, selects head-specific tokens independently, and aligns them into composite tokens that respect the uniform cache structure required by existing inference engines. A global allocation mechanism further adapts retention budgets across layers, assigning more capacity to layers with informative tokens. This approach achieves significant memory reduction while preserving accuracy, consistently outperforming prior structured and semi-structured methods. Crucially, our approach remains fully compatible with standard inference pipelines, offering a practical and scalable solution for efficient long-context LLM deployment.
Abstract（参考訳）: 大規模言語モデル(LLM)は、効率的な自己回帰復号化のためにキー値(KV)キャッシュに依存するが、キャッシュサイズはコンテキスト長とモデル深さとともに線形に増加し、長いコンテキスト推論において大きなボトルネックとなる。以前のKVキャッシュ圧縮手法では、厳密なヒューリスティックを強制するか、テンソルレイアウトをアテンションごとの変動で破壊するか、特別な計算カーネルを必要とする。我々は,注意誘導型,層適応型複合トークンに基づく,シンプルで効果的なKVキャッシュ圧縮フレームワークを提案する。提案手法は,トークンの重要度を推定するために注目スコアを集約し,個別に頭固有トークンを選択し,既存の推論エンジンで要求される均一なキャッシュ構造を尊重する複合トークンに整列させる。グローバルなアロケーションメカニズムは、レイヤ間の保持予算をさらに適応させ、情報的トークンを持つレイヤにより多くのキャパシティを割り当てる。このアプローチは精度を保ちながらメモリの大幅な削減を実現し、従来と半構造化の手法を一貫して上回っている。重要なことは、我々のアプローチは標準の推論パイプラインと完全に互換性があり、効率的なLLMデプロイメントのための実用的でスケーラブルなソリューションを提供しています。

論文の概要: KVCompose: Efficient Structured KV Cache Compression with Composite Tokens

関連論文リスト