Fugu-MT 論文翻訳(概要): Judge Q: Trainable Queries for Optimized Information Retention in KV Cache Eviction

論文の概要: Judge Q: Trainable Queries for Optimized Information Retention in KV Cache Eviction

arxiv url: http://arxiv.org/abs/2509.10798v1
Date: Sat, 13 Sep 2025 03:34:12 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-16 17:26:22.782102
Title: Judge Q: Trainable Queries for Optimized Information Retention in KV Cache Eviction
Title（参考訳）: 審査員Q:KVキャッシュの最適情報保持のためのトレーニング可能なクエリ
Authors: Yijun Liu, Yixuan Wang, Yuzhuang Xu, Shiyu Ji, Yang Xu, Qingfu Zhu, Wanxiang Che,
Abstract要約: 大規模言語モデル(LLM)は、キー値(KV)キャッシュを使用して、シーケンス処理中に履歴情報を格納する。 KVキャッシュ消去の現在の方法は、通常、プレフィルフェーズからの最後のウィンドウをクエリとして利用し、消去のためのKV重要度スコアを計算する。ソフトトークンリストを組み込んだ新しいトレーニング手法であるジャッジQを提案する。
参考スコア（独自算出の注目度）: 53.83828564664595
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Large language models (LLMs) utilize key-value (KV) cache to store historical information during sequence processing. The size of KV cache grows linearly as the length of the sequence extends, which seriously affects memory usage and decoding efficiency. Current methods for KV cache eviction typically utilize the last window from the pre-filling phase as queries to compute the KV importance scores for eviction. Although this scheme is simple to implement, it tends to overly focus on local information, potentially leading to the neglect or omission of crucial global information. To mitigate this issue, we propose Judge Q, a novel training method which incorporates a soft token list. This method only tunes the model's embedding layer at a low training cost. By concatenating the soft token list at the end of the input sequence, we train these tokens' attention map to the original input sequence to align with that of the actual decoded tokens. In this way, the queries corresponding to the soft tokens can effectively capture global information and better evaluate the importance of the keys and values within the KV cache, thus maintaining decoding quality when KV cache is evicted. Under the same eviction budget, our method exhibits less performance degradation compared to existing eviction approaches. We validate our approach through experiments conducted on models such as Llama-3.1-8B-Instruct and Mistral-7B-Instruct-v0.3, using benchmarks including LongBench, RULER, and Needle-in-a-Haystack. Results indicate an improvement of approximately 1 point on the LongBench and over 3 points on RULER. This proposed methodology can be seamlessly integrated into existing open-source models with minimal training overhead, thereby enhancing performance in KV cache eviction scenarios.
Abstract（参考訳）: 大規模言語モデル(LLM)は、キーバリュー(KV)キャッシュを使用して、シーケンス処理中に履歴情報を格納する。 KVキャッシュのサイズは、シーケンスの長さが長くなるにつれて線形に増加し、メモリ使用量や復号効率に深刻な影響を及ぼす。 KVキャッシュ消去の現在の方法は、通常、プレフィルフェーズからの最後のウィンドウをクエリとして利用し、消去のためのKV重要度スコアを計算する。このスキームの実装は簡単だが、ローカル情報に過度に注目する傾向があるため、重要なグローバル情報の無視や排除につながる可能性がある。この問題を軽減するために,ソフトトークンリストを組み込んだ新しいトレーニング手法であるジャッジQを提案する。この方法は、低いトレーニングコストでモデルの埋め込み層をチューニングするだけである。入力シーケンスの最後にソフトトークンリストを連結することにより、これらのトークンのアテンションマップを元の入力シーケンスにトレーニングし、実際のデコードされたトークンと整合させる。これにより、ソフトトークンに対応するクエリは、グローバル情報を効果的にキャプチャし、KVキャッシュ内のキーと値の重要性をよりよく評価し、KVキャッシュが削除された場合の復号品質を維持することができる。同じエビクション予算の下では,既存のエビクション手法に比べて性能劣化が小さい。 Llama-3.1-8B-InstructやMistral-7B-Instruct-v0.3といったモデルを用いて,LongBench,RULER,Needle-in-a-Haystackなどのベンチマークを用いて本手法の有効性を検証する。その結果,LongBenchでは約1点,RULERでは3点以上の改善が見られた。提案手法は,トレーニングオーバーヘッドを最小限に抑えた既存のオープンソースモデルにシームレスに統合することで,KVキャッシュ消去シナリオの性能を向上させることができる。

論文の概要: Judge Q: Trainable Queries for Optimized Information Retention in KV Cache Eviction

関連論文リスト