Fugu-MT 論文翻訳(概要): SkipKV: Selective Skipping of KV Generation and Storage for Efficient Inference with Large Reasoning Models

論文の概要: SkipKV: Selective Skipping of KV Generation and Storage for Efficient Inference with Large Reasoning Models

arxiv url: http://arxiv.org/abs/2512.07993v1
Date: Mon, 08 Dec 2025 19:32:06 GMT
ステータス: 翻訳完了
システム内更新日: 2025-12-10 22:28:07.705781
Title: SkipKV: Selective Skipping of KV Generation and Storage for Efficient Inference with Large Reasoning Models
Title（参考訳）: SkipKV:大共振モデルを用いた効率的な推論のためのKV生成とストレージの選択的スキッピング
Authors: Jiayi Tian, Seyedarmin Azizi, Yequan Zhao, Erfan Baghaei Potraghloo, Sean McPherson, Sharath Nittur Sridhar, Zhengyang Wang, Zheng Zhang, Massoud Pedram, Souvik Kundu,
Abstract要約: 大きな推論モデル(LRM)は、チェーン・オブ・ソート(CoT)推論プロセスで線形に成長するため、重要なキー値(KV)キャッシュのオーバーヘッドがかかることが多い。粗い文レベルのシーケンスを除去するKV圧縮手法である textbfSkipKV を提案する。
参考スコア（独自算出の注目度）: 25.509962883211
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large reasoning models (LRMs) often cost significant key-value (KV) cache overhead, due to their linear growth with the verbose chain-of-thought (CoT) reasoning process. This costs both memory and throughput bottleneck limiting their efficient deployment. Towards reducing KV cache size during inference, we first investigate the effectiveness of existing KV cache eviction methods for CoT reasoning. Interestingly, we find that due to unstable token-wise scoring and the reduced effective KV budget caused by padding tokens, state-of-the-art (SoTA) eviction methods fail to maintain accuracy in the multi-batch setting. Additionally, these methods often generate longer sequences than the original model, as semantic-unaware token-wise eviction leads to repeated revalidation during reasoning. To address these issues, we present \textbf{SkipKV}, a \textbf{\textit{training-free}} KV compression method for selective \textit{eviction} and \textit{generation} operating at a coarse-grained sentence-level sequence removal for efficient CoT reasoning. In specific, it introduces a \textit{sentence-scoring metric} to identify and remove highly similar sentences while maintaining semantic coherence. To suppress redundant generation, SkipKV dynamically adjusts a steering vector to update the hidden activation states during inference enforcing the LRM to generate concise response. Extensive evaluations on multiple reasoning benchmarks demonstrate the effectiveness of SkipKV in maintaining up to $\mathbf{26.7}\%$ improved accuracy compared to the alternatives, at a similar compression budget. Additionally, compared to SoTA, SkipKV yields up to $\mathbf{1.6}\times$ fewer generation length while improving throughput up to $\mathbf{1.7}\times$.
Abstract（参考訳）: 大きな推論モデル(LRM)は、冗長チェーン・オブ・シークレット(CoT)推論プロセスによる線形成長のため、重要なキー値(KV)キャッシュオーバーヘッドを伴わないことが多い。これにより、メモリとスループットのボトルネックによって、効率的なデプロイメントが制限される。推論中のKVキャッシュサイズを削減するために,CoT推論における既存のKVキャッシュ消去手法の有効性について検討する。興味深いことに、不安定なトークン単位のスコアリングと、パラディングトークンによる有効KV予算の削減により、最先端(SoTA)消去法はマルチバッチ設定における精度の維持に失敗している。さらに、これらの手法は、意味不明なトークン単位の排除によって推論中に繰り返し再検証されるため、元のモデルよりも長いシーケンスを生成することが多い。これらの問題に対処するため,大まかな文レベルのシーケンスを除去して効率の良いCoT推論を行うために,選択的な \textit{eviction} に対して \textbf{SkipKV}, \textbf{\textit{training-free}} KV 圧縮法を提案する。具体的には、意味的コヒーレンスを維持しながら、非常に類似した文を識別し、削除する「textit{sentence-scoring metric」を導入する。冗長発生を抑制するため、SkipKVはステアリングベクトルを動的に調整し、LEMを強制する推論中に隠れた活性化状態を更新し、簡潔な応答を生成する。複数の推論ベンチマークに対する広範囲な評価は、SkipKVが同等の圧縮予算で、代替品と比較して最大$\mathbf{26.7}\%の精度を維持したことを示す。さらに、SoTAと比較して、SkipKVは最大$\mathbf{1.6}\times$生成長を減らし、スループットを$\mathbf{1.7}\times$に改善する。

論文の概要: SkipKV: Selective Skipping of KV Generation and Storage for Efficient Inference with Large Reasoning Models

関連論文リスト