Fugu-MT 論文翻訳(概要): RateQuant: Optimal Mixed-Precision KV Cache Quantization via Rate-Distortion Theory

論文の概要: RateQuant: Optimal Mixed-Precision KV Cache Quantization via Rate-Distortion Theory

arxiv url: http://arxiv.org/abs/2605.06675v1
Date: Wed, 22 Apr 2026 02:31:58 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-25 12:34:33.668143
Title: RateQuant: Optimal Mixed-Precision KV Cache Quantization via Rate-Distortion Theory
Title（参考訳）: RateQuant: 速度歪み理論による最適混合精度KVキャッシュ量子化
Authors: Fei Zuo, Zikang Zhou, Hao Cong, Xiaoyan Xi, Ho Fai Leung,
Abstract要約: 自然な考え方は、重要な頭により多くのビットを割り当て、残りを減らすことである。ある量子化器の歪みモデルを別の量子化器に適用すると、割り当て順序が逆になり、均一な量子化よりも性能が悪くなる。 RateQuantは、小さなキャリブレーションセットから量子化器毎の歪みモデルに適合し、結果として生じるビット割り当て問題を閉じた形で解決する。
参考スコア（独自算出の注目度）: 3.307797786204237
License: http://creativecommons.org/publicdomain/zero/1.0/
Abstract: Large language models cache all previously computed key-value (KV) pairs during generation, and this KV cache grows linearly with sequence length, making it a primary memory bottleneck for serving. Quantizing the KV cache to fewer bits reduces this cost, yet all current quantizers assign the same bit-width to every attention head, ignoring the large variation in head importance. A natural idea is to allocate more bits to important heads and fewer to the rest. We show, however, that such mixed-precision allocation has a hidden pitfall: each quantizer follows a different distortion curve D(b)=alpha*beta^{-b}, and the decay rate beta varies from 3.6 to 5.3 across quantizer designs. Applying one quantizer's distortion model to another inverts the allocation order and makes performance worse than uniform quantization. We call this failure mode distortion model mismatch and propose RateQuant to resolve it. RateQuant fits a per-quantizer distortion model from a small calibration set, then solves the resulting bit-allocation problem in closed form via reverse waterfilling from rate-distortion theory. On Qwen3-8B at 2.5 average bits, calibrated RateQuant reduces KIVI's perplexity from 49.3 to 14.9 (70% reduction) and improves QuaRot by 6.6 PPL. The entire calibration takes 1.6 s on a single GPU and adds zero overhead at inference time.
Abstract（参考訳）: 大規模言語モデルでは、生成中にすべての計算済みキー値(KV)ペアをキャッシュし、このKVキャッシュはシーケンス長とともに線形に増加し、サービスのための主要なメモリボトルネックとなる。 KVキャッシュを少ないビットに量子化することで、このコストを削減できるが、現在の量子化器はすべての注目ヘッドに同じビット幅を割り当てる。自然な考え方は、重要な頭により多くのビットを割り当て、残りを減らすことである。それぞれの量子化器は、異なる歪み曲線 D(b)=alpha*beta^{-b} に従っており、崩壊速度ベータは、量子化器の設計全体で3.6から5.3まで変化する。ある量子化器の歪みモデルを別の量子化器に適用すると、割り当て順序が逆になり、均一な量子化よりも性能が悪くなる。我々は、この障害モード歪みモデルミスマッチを呼び出し、それを解決するためにRateQuantを提案します。 RateQuantは、小さなキャリブレーションセットから量子化器毎の歪みモデルに適合し、レート歪み理論からの逆水埋め込みにより、結果として生じるビット配置問題を閉じた形で解決する。 Qwen3-8B平均2.5ビットでは、キャリブレーションされたRateQuantはKIVIの難易度を49.3から14.9(70%削減)に低減し、QuaRotを6.6PPL改善する。キャリブレーション全体は、1つのGPU上で1.6秒かかり、推論時にオーバーヘッドがゼロになる。

論文の概要: RateQuant: Optimal Mixed-Precision KV Cache Quantization via Rate-Distortion Theory

関連論文リスト