Fugu-MT 論文翻訳(概要): Adaptive KV-Cache Compression without Manually Setting Budget

論文の概要: Adaptive KV-Cache Compression without Manually Setting Budget

arxiv url: http://arxiv.org/abs/2509.03136v1
Date: Wed, 03 Sep 2025 08:38:40 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-04 21:40:46.465105
Title: Adaptive KV-Cache Compression without Manually Setting Budget
Title（参考訳）: 手作業による予算設定のない適応KVキャッシュ圧縮
Authors: Chenxia Tang, Jianchun Liu, Hongli Xu, Liusheng Huang,
Abstract要約: 大規模言語モデル (LLM) の推論は、自動回帰復号化を加速するために KV-caches に大きく依存している。現在のKV-cache圧縮法はプロクリストスのベッド問題に悩まされている。我々は,手作業による予算仕様を排除した適応型KV-cache圧縮スキームであるGVoteを提案する。
参考スコア（独自算出の注目度）: 30.469232780086532
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models (LLMs) inference relies heavily on KV-caches to accelerate autoregressive decoding, but the resulting memory footprint grows rapidly with sequence length, posing significant efficiency challenges. Current KV-cache compression methods suffer from a Procrustes' bed problem: they force diverse workloads into fixed compression ratios, leading to suboptimal resource allocation and inference performance. To this end, we present GVote, an adaptive KV-cache compression scheme that eliminates manual budget specification while achieving superior accuracy-efficiency trade-offs. GVote operates on the principle that the important keys are the aggregation of keys required by future queries. The method predicts future query attention demands by Monte-Carlo style sampling potential queries and aggregating selected keys to determine the optimal cache budget without manual specification. Experimental evaluation demonstrates GVote's effectiveness across multiple benchmarks, including GSM8K, RULER and Longbench. Compared to baselines, GVote exhibits 2$\times$ memory reduction while the accuracy maintains higher or comparable.
Abstract（参考訳）: 大規模言語モデル(LLM)推論は、自動回帰復号化を加速するためにKV-cachesに大きく依存するが、結果として生じるメモリフットプリントはシーケンス長とともに急速に増加し、大きな効率上の課題を生じさせる。現在のKV-cache圧縮手法は、さまざまなワークロードを固定された圧縮比率に強制し、リソース割り当てと推論性能を最適化する、プロクリストスのベッド問題に悩まされている。この目的のために、GVoteという適応的なKV-cache圧縮方式を提案する。 GVoteは、重要なキーは将来のクエリに必要なキーの集約である、という原則に基づいている。提案手法は,モンテカルロスタイルの潜在的なクエリをサンプリングし,選択したキーを集約して,手動による指定なしに最適なキャッシュ予算を決定することで,今後のクエリアテンション要求を予測する。実験的評価では、GSM8K、RULER、Longbenchを含む複数のベンチマークでGVoteの有効性が示されている。ベースラインと比較して、GVoteは2$\times$メモリ削減を示し、精度は高いか同等である。

論文の概要: Adaptive KV-Cache Compression without Manually Setting Budget

関連論文リスト