Fugu-MT 論文翻訳(概要): DynSplit-KV: Dynamic Semantic Splitting for KVCache Compression in Efficient Long-Context LLM Inference

論文の概要: DynSplit-KV: Dynamic Semantic Splitting for KVCache Compression in Efficient Long-Context LLM Inference

arxiv url: http://arxiv.org/abs/2602.03184v1
Date: Tue, 03 Feb 2026 06:54:56 GMT
ステータス: 翻訳完了
システム内更新日: 2026-02-04 18:37:15.297091
Title: DynSplit-KV: Dynamic Semantic Splitting for KVCache Compression in Efficient Long-Context LLM Inference
Title（参考訳）: DynSplit-KV: リアルタイムLLM推論におけるKVキャッシュ圧縮のための動的セマンティックスプリッティング
Authors: Jiancai Ye, Jun Liu, Qingchen Li, Tianlang Zhao, Hanbin Zhang, Jiayi Pan, Ningyi Xu, Guohao Dai,
Abstract要約: KVキャッシュは、効率的な言語モデル(LLM)推論に必須である。現在の圧縮法は、固定間隔や事前定義のような厳密な分割戦略に依存している。分割に使用するセマンティックブロックを動的に識別するKVCache圧縮手法であるDyn-KVを提案する。実験の結果、Dyn-KVはFlashAttentionと比較して2.2倍のスピードアップを実現し、長いコンテキストシナリオでは2.6倍のピークメモリ削減を実現している。
参考スコア（独自算出の注目度）: 14.476177850166126
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Although Key-Value (KV) Cache is essential for efficient large language models (LLMs) inference, its growing memory footprint in long-context scenarios poses a significant bottleneck, making KVCache compression crucial. Current compression methods rely on rigid splitting strategies, such as fixed intervals or pre-defined delimiters. We observe that rigid splitting suffers from significant accuracy degradation (ranging from 5.5% to 55.1%) across different scenarios, owing to the scenario-dependent nature of the semantic boundaries. This highlights the necessity of dynamic semantic splitting to match semantics. To achieve this, we face two challenges. (1) Improper delimiter selection misaligns semantics with the KVCache, resulting in 28.6% accuracy loss. (2) Variable-length blocks after splitting introduce over 73.1% additional inference overhead. To address the above challenges, we propose DynSplit-KV, a KVCache compression method that dynamically identifies delimiters for splitting. We propose: (1) a dynamic importance-aware delimiter selection strategy, improving accuracy by 49.9%. (2) A uniform mapping strategy that transforms variable-length semantic blocks into a fixed-length format, reducing inference overhead by 4.9x. Experiments show that DynSplit-KV achieves the highest accuracy, 2.2x speedup compared with FlashAttention and 2.6x peak memory reduction in long-context scenarios.
Abstract（参考訳）: キーバリュー(KV)キャッシュは効率的な大規模言語モデル(LLM)推論には不可欠だが、長期コンテキストシナリオにおけるメモリフットプリントの増加は大きなボトルネックとなり、KVCacheの圧縮が不可欠になる。現在の圧縮法は、固定間隔や事前定義されたデリミタのような厳密な分割戦略に依存している。厳密な分割は, セマンティック境界のシナリオ依存的な性質のため, 異なるシナリオにまたがる顕著な精度劣化(5.5%から55.1%)に悩まされる。これは、セマンティクスにマッチする動的セマンティクス分割の必要性を強調している。これを達成するために、私たちは2つの課題に直面します。 1) 不適切なデリミタ選択はKVCacheとセマンティクスを誤用し、28.6%の精度を失う。 2)分割後の可変長ブロックには73.1%以上の追加の推論オーバーヘッドが導入されている。以上の課題に対処するため,KVCache圧縮方式であるDynSplit-KVを提案する。 1) 動的重要度を考慮したデリミタ選択戦略を提案し, 精度を49.9%向上させた。 2) 可変長のセマンティックブロックを固定長のフォーマットに変換し、推論オーバーヘッドを4.9倍に削減する一様マッピング戦略。実験の結果、DynSplit-KVはFlashAttentionと比較して2.2倍の高速化を実現し、長いコンテキストシナリオでは2.6倍のピークメモリ削減を実現している。

論文の概要: DynSplit-KV: Dynamic Semantic Splitting for KVCache Compression in Efficient Long-Context LLM Inference

関連論文リスト