Fugu-MT 論文翻訳(概要): When Quantization Is Free: An int4 KV Cache That Outruns fp16 on Apple Silicon

論文の概要: When Quantization Is Free: An int4 KV Cache That Outruns fp16 on Apple Silicon

arxiv url: http://arxiv.org/abs/2605.05699v1
Date: Thu, 07 May 2026 05:44:39 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-08 22:27:11.537063
Title: When Quantization Is Free: An int4 KV Cache That Outruns fp16 on Apple Silicon
Title（参考訳）: 量子化は無料:Apple Siliconでfp16を出力するInt4 KVキャッシュ
Authors: Mohamed Amine Bergach,
Abstract要約: KVキャッシュ量子化は、品質-レイテンシトレードオフとしてフレーム化される。 Apple Siliconの統一メモリにインセンティブを与えています。
参考スコア（独自算出の注目度）: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: KV-cache quantization is framed as a quality--latency trade-off. We show it is \emph{inverted} on Apple Silicon's unified memory: a single fused Metal kernel (sign-randomized FFT $+$ per-channel $λ$ $+$ per-group abs-max $+$ int4 nibble pack), exposed as a HuggingFace \texttt{Cache} subclass, runs \emph{faster than fp16} across $256$--$4096$-token prefixes on Gemma-3 1B ($-3$ to $-8\%$ ms/tok) and at short context on Qwen2.5-1.5B ($-0.7$ to $-2.6\%$ through $1$K), with $3\times$ persistent memory compression and quality preserved ($\dPPL = 0.000$ Qwen short-prompt; $+3.6$ hook $\dPPL$ Gemma). The kernel's $\sim\!25$\,ns/vec overhead is below the bandwidth savings from $3\times$ compression. The fused kernel also closes Qwen's 4-bit per-token catastrophe ($\dPPL = +7975 \to +638.6$, $12.5\times$ reduction) at $182$\,GFLOPS / $D{=}128$. Supporting findings: $\SRFT$ and $\SRHT$ are statistically indistinguishable for KV quality (we pick $\SRFT$ for mixed-radix and matrix-multiply alignment); a learned-rotation ablation surfaces a regularization role for the fixed random SRFT base (learning $R+λ$ without SRFT lowers calibration MSE $84.9\%$ vs $50.3\%$ but yields worse PPL); Householder rotations at $k{=}d/2$ reflectors are effectively lossless at $d{=}256$.
Abstract（参考訳）: KVキャッシュ量子化は、品質-レイテンシトレードオフとしてフレーム化される。 a single fused Metal kernel (sign-randomized FFT $+$ per-channel $λ$+$ per-group abs-max $+$ int4 nibble pack), exposed as a HuggingFace \texttt{Cache} subclass, run \emph{faster than fp16} across $256$--$4096$-token prefixes on Gemma-3 1B $-3$ to $-8\%$ ms/tok), at short context on Qwen2.5-1.5B $-0.7$ to $-2.6$% through $1K), with $3\times$ Per-group abs-max $+$ int4 nibble pack, exposed as a HuggingFace \texttt{Cache} subclass, run \emph{faster than fp16} across $256$-$-$4096$-token prefixes for Gemma-3B ($-3$ to $-8\%$ ms/tok), at short context on Qwen2.5-1.5B $-1.5B $0.7$ to $-2.6$ $ $1K, $3\times $d 圧縮と保存されたメモリと保存されたメモリ容量は、$000PPL = 0.000$6$3$3$3$3$3$3$である。カーネルは$\sim\! 25$\,ns/vecのオーバーヘッドは、$3\times$圧縮による帯域幅の節約よりも低い。融合カーネルはまた、Qwenの4ビット毎のカタストロフィ(\dPPL = +7975 \to +638.6$, $12.5\times$ reduction)を182$\,GFLOPS / $D{=}128$で閉じる。結果:$\SRFT$と$\SRHT$はKVの品質に対して統計的に区別できない(混合基数と行列-多重アライメントに対して$\SRFT$を選ぶ)、学習回転アブレーションは固定ランダムSRFTベースに対して正規化ロールを表面化する(SRFTなしでの学習$R+λ$はキャリブレーション MSE 8,4.9\%$対50.3\%$であるが、より悪いPPLになる)、$k{=}d/2$リフレクタは$d{=}256$で事実上損失のない。

論文の概要: When Quantization Is Free: An int4 KV Cache That Outruns fp16 on Apple Silicon

関連論文リスト