Fugu-MT 論文翻訳(概要): IsoQuant: Hardware-Aligned SO(4) Isoclinic Rotations for LLM KV Cache Compression

論文の概要: IsoQuant: Hardware-Aligned SO(4) Isoclinic Rotations for LLM KV Cache Compression

arxiv url: http://arxiv.org/abs/2603.28430v1
Date: Mon, 30 Mar 2026 13:37:45 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-31 23:18:45.422585
Title: IsoQuant: Hardware-Aligned SO(4) Isoclinic Rotations for LLM KV Cache Compression
Title（参考訳）: IsoQuant: LLM KVキャッシュ圧縮のためのハードウェア対応SO(4)等クリニックローテーション
Authors: Zhongping Ji,
Abstract要約: 四元数代数に基づくブロックワイズ回転フレームワークと、SO(4)$の等クリニック分解を提案する。 IsoQuantは、平均的なカーネルレベルのスピードアップを4.5times$--$4.7times$ over RotorQuantで達成し、ピーク時のスピードアップは6times$以上である。
参考スコア（独自算出の注目度）: 0.4496256885343706
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Orthogonal feature decorrelation is effective for low-bit online vector quantization, but dense random orthogonal transforms incur prohibitive $O(d^2)$ storage and compute. RotorQuant reduces this cost with blockwise $3$D Clifford rotors, yet the resulting $3$D partition is poorly aligned with modern hardware and offers limited local mixing. We propose \textbf{IsoQuant}, a blockwise rotation framework based on quaternion algebra and the isoclinic decomposition of $SO(4)$. It represents each $4$D block as a quaternion and applies a closed-form transform $T(v)=q_L v \overline{q_R}$. This yields two main variants: \emph{IsoQuant-Full}, which realizes the full $SO(4)$ rotation, and \emph{IsoQuant-Fast}, which keeps only one isoclinic factor for lower cost; the framework also admits a lightweight $2$D special case. At $d=128$, IsoQuant-Full reduces forward rotation cost from about $2{,}408$ FMAs in RotorQuant to $1{,}024$, while IsoQuant-Fast further reduces it to $512$. Across $18$ fused CUDA settings with $d \in {128,256,512}$, bit widths ${2,3,4}$, and FP16/FP32 execution, IsoQuant achieves mean kernel-level speedups of about $4.5\times$--$4.7\times$ over RotorQuant while maintaining comparable reconstruction MSE, with peak speedups above $6\times$. Current validation is limited to the stage-1 quantize--dequantize path on synthetic normalized vectors; end-to-end KV-cache evaluation remains future work.
Abstract（参考訳）: 直交特徴デコリレーションは低ビットオンラインベクトル量子化に有効であるが、高密度なランダム直交変換は禁忌な$O(d^2)$ストレージと計算を行う。 RotorQuantはこのコストを3ドル(約3,300円)のクリフォード・ローターで削減するが、結果として3ドル(約3,300円)のパーティションは現代のハードウェアと不整合であり、限定的なローカルミキシングを提供する。四元数代数に基づくブロックワイズ回転フレームワークである「textbf{IsoQuant}」と、SO(4)$の等クリニック分解を提案する。 4ドルのブロックを四元数として表し、閉じた変換を$T(v)=q_L v \overline{q_R}$とする。これにより、完全な$SO(4)$回転を実現する \emph{IsoQuant-Full} と、低コストで1つのアイソクリニック因子を保持する \emph{IsoQuant-Fast} の2つの主要な変種が得られる。 IsoQuant-Fullは$d=128$で、RotorQuantのFMAを$${,}408$から${,}024$に下げ、IsoQuant-Fastは$512$に下げる。 128,256,512}$の$d \in {128,256,512}$、ビット幅${2,3,4}$、FP16/FP32の実行で、IsoQuantは平均的なカーネルレベルのスピードアップを約4.5\times$-$4.7\times$ over RotorQuantで達成している。現在の検証は、合成正規化ベクトル上のステージ-1量子化-値化経路に限られている。

論文の概要: IsoQuant: Hardware-Aligned SO(4) Isoclinic Rotations for LLM KV Cache Compression

関連論文リスト