Fugu-MT 論文翻訳(概要): SemShareKV: Efficient KVCache Sharing for Semantically Similar Prompts via Token-Level LSH Matching

論文の概要: SemShareKV: Efficient KVCache Sharing for Semantically Similar Prompts via Token-Level LSH Matching

arxiv url: http://arxiv.org/abs/2509.24832v1
Date: Mon, 29 Sep 2025 14:16:13 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-30 22:32:20.041784
Title: SemShareKV: Efficient KVCache Sharing for Semantically Similar Prompts via Token-Level LSH Matching
Title（参考訳）: SemShareKV:Token-Level LSHマッチングによるSemantically Like Promptsの効率的なKVキャッシュ共有
Authors: Xinye Zhao, Spyridon Mastorakis,
Abstract要約: 大規模言語モデル(LLM)のためのKVキャッシュ共有圧縮フレームワークである textitSemShareKV を提案する。正確なトークンマッチに頼る代わりに、SemShareKVは、トークン埋め込みにローカリティ感受性ハッシュ(LSH)を使用してファジィトークンマッチングを適用し、位置情報をよりよく保存するためにロータリー位置埋め込み(Rotary Position Embedding、RoPE)を組み込んでいる。多様な要約データセットの実験では、最大6.25$times$スピードアップと42%低いGPUメモリ使用率で5kトークンが入力され、品質劣化は無視できる。
参考スコア（独自算出の注目度）: 0.8307668828380427
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: As large language models (LLMs) continue to scale, the memory footprint of key-value (KV) caches during inference has become a significant bottleneck. Existing approaches primarily focus on compressing KV caches within a single prompt or reusing shared prefixes or frequently ocurred text segments across prompts. However, such strategies are limited in scenarios where prompts are semantically similar but lexically different, which frequently occurs in tasks such as multi-document summarization and conversational agents. We propose \textit{SemShareKV}, a KV cache sharing and compression framework that accelerates LLM inference by reusing KVCache in semantically similar prompts. Instead of relying on exact token matches, SemShareKV applies fuzzy token matching using locality-sensitive hashing (LSH) on token embeddings and incorporates Rotary Position Embedding (RoPE) to better preserve positional information. By selectively reusing relevant key-value pairs from a reference prompt's cache, SemShareKV reduces redundant computation while maintaining output quality. Experiments on diverse summarization datasets show up to 6.25$\times$ speedup and 42\% lower GPU memory usage with 5k tokens input, with negligible quality degradation. These results highlight the potential of semantic-aware cache sharing for efficient LLM inference.
Abstract（参考訳）: 大規模言語モデル(LLM)のスケールアップが進むにつれて、推論中のキー値(KV)キャッシュのメモリフットプリントが重大なボトルネックとなっている。既存のアプローチは、主に1つのプロンプト内でKVキャッシュを圧縮したり、共有プレフィックスを再利用したり、プロンプトをまたいだ頻繁に再帰されたテキストセグメントにフォーカスする。しかし、このような戦略は意味論的に類似しているが語彙的に異なるシナリオにおいて限られており、多文書要約や会話エージェントといったタスクで頻繁に発生する。我々は,KVキャッシュを意味的に類似したプロンプトで再利用することにより,LLM推論を高速化するKVキャッシュ共有圧縮フレームワークである‘textit{SemShareKV} を提案する。正確なトークンマッチに頼る代わりに、SemShareKVは、トークン埋め込みにローカリティ感受性ハッシュ(LSH)を使用してファジィトークンマッチングを適用し、位置情報をよりよく保存するためにロータリー位置埋め込み(Rotary Position Embedding、RoPE)を組み込んでいる。参照プロンプトのキャッシュから関連するキーと値のペアを選択的に再利用することで、SemShareKVは出力品質を維持しながら冗長な計算を減らす。さまざまな要約データセットの実験では、最大6.25$\times$スピードアップと5kトークン入力による低いGPUメモリ使用率42\%で、品質劣化が無視できる。これらの結果は,効率的なLLM推論のためのセマンティック・アウェア・キャッシュ共有の可能性を強調した。

論文の概要: SemShareKV: Efficient KVCache Sharing for Semantically Similar Prompts via Token-Level LSH Matching

関連論文リスト