Fugu-MT 論文翻訳(概要): RDKV: Rate-Distortion Bit Allocation for Joint Eviction and Quantization of the KV Cache

論文の概要: RDKV: Rate-Distortion Bit Allocation for Joint Eviction and Quantization of the KV Cache

arxiv url: http://arxiv.org/abs/2605.08317v1
Date: Fri, 08 May 2026 15:15:06 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-12 23:28:49.561328
Title: RDKV: Rate-Distortion Bit Allocation for Joint Eviction and Quantization of the KV Cache
Title（参考訳）: RDKV:KVキャッシュの共振および量子化のためのレート歪みビット割り当て
Authors: Junkai Zhang, Hang Guo, Luca Benini, Yawei Li,
Abstract要約: 大規模言語モデル(LLM)は様々なタスクにまたがって高い性能を示すが、長い入力コンテキストでの推論はメモリサイズと帯域幅によってボトルネックとなる。既存のメソッドは、消去または量子化によってキャッシュを減らすが、通常は2つを分離して扱う。本稿では、KVキャッシュ圧縮をレート歪み問題とみなし、同じビット割り当て方式の2つの端点の消去と量子化を行う。
参考スコア（独自算出の注目度）: 28.54642982960947
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) have shown strong performance across diverse tasks, but their inference with long input contexts is bottlenecked by memory size and bandwidth. The Key-Value (KV) cache size grows linearly with sequence length and needs to be re-read from off-chip high-bandwidth memory (HBM) to on-chip memory at every decoding step, resulting in memory-bound inference. Existing methods reduce the cache by either eviction or quantization, but typically treat the two in isolation. In this paper, we cast KV cache compression as a rate-distortion problem, under which eviction and quantization are two end-points of the same bit allocation scheme. This exposes the need to optimize them jointly, motivating our method, RDKV (Rate-Distortion KV cache compression). RDKV derives the weight of each token or channel from the distortion that compression induces on the attention computation. Based on these weights, it assigns each token or channel a bit-width ranging from full precision down to zero bits guided by reverse water-filling, applied once after the prefilling stage. Experiments on LongBench, RULER, and InfiniteBench show that RDKV outperforms the best evaluated baseline by 9.1% on average. On LongBench it recovers 97.81% of full-cache accuracy with only 2.48% cache retention. Compared with full-cache FlashAttention-2 decoding, it achieves 4.5x decode speedup and 1.9x peak memory reduction with 128K context length, while maintaining comparable performance.
Abstract（参考訳）: 大規模言語モデル(LLM)は様々なタスクに対して高いパフォーマンスを示すが、長い入力コンテキストでの推論はメモリサイズと帯域幅によってボトルネックとなる。キーバリュー(KV)キャッシュサイズはシーケンス長とともに線形に増加し、デコードステップ毎にオフチップのハイバンド幅メモリ(HBM)からオンチップメモリに再読み込みする必要があるため、メモリバウンド推論が発生する。既存のメソッドは、消去または量子化によってキャッシュを減らすが、通常は2つを分離して扱う。本稿では、KVキャッシュ圧縮をレート歪み問題とみなし、同じビット割り当て方式の2つの端点の消去と量子化を行う。これにより、共同で最適化する必要がなくなり、RDKV(Rate-Distortion KV cache compression)というメソッドを動機付けます。 RDKVは、各トークンやチャネルの重みを、圧縮が注意計算で引き起こす歪みから導き出す。これらの重みに基づいて、各トークンまたはチャネルに全精度から、前処理段階の後に一度適用された逆水充填によって導かれるゼロビットまでのビット幅を割り当てる。 LongBench、RULER、InfiniteBenchの実験では、RDKVは平均9.1%で評価されたベースラインを上回っている。 LongBenchでは、全キャッシュ精度の97.81%を回復し、キャッシュ保持率は2.48%に過ぎなかった。フルキャッシュのFlashAttention-2デコードと比較すると、4.5倍のデコードスピードアップと1.9倍のピークメモリ削減を実現し、128Kのコンテキスト長を持つ。

論文の概要: RDKV: Rate-Distortion Bit Allocation for Joint Eviction and Quantization of the KV Cache

関連論文リスト