Fugu-MT 論文翻訳(概要): ITQ3_S: High-Fidelity 3-bit LLM Inference via Interleaved Ternary Quantization with Rotation-Domain Smoothing

論文の概要: ITQ3_S: High-Fidelity 3-bit LLM Inference via Interleaved Ternary Quantization with Rotation-Domain Smoothing

arxiv url: http://arxiv.org/abs/2603.27914v2
Date: Tue, 31 Mar 2026 03:02:45 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-01 15:25:02.372625
Title: ITQ3_S: High-Fidelity 3-bit LLM Inference via Interleaved Ternary Quantization with Rotation-Domain Smoothing
Title（参考訳）: ITQ3_S:回転領域平滑化を用いたインターリーブ3次量子化による高忠実3ビットLLM推論
Authors: Edward J. Yoon,
Abstract要約: 我々は,TurboQuant(TQ)を統合したLLMのための新しい3ビット重み量子化フォーマットであるITQ3_S(Interleaved Ternary Quantization -- Specialized)を提案する。従来の3ビット法では、重み付き重み分布とチャネル間外周による精度の低下が見られた。 ITQ3_Sは、量子化の前にFWHTを介して重み空間を前回転させ、ベクトルにエネルギーを分散させ、ガウス近傍の分布を誘導する。
参考スコア（独自算出の注目度）: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We present ITQ3_S (Interleaved Ternary Quantization -- Specialized), a novel 3-bit weight quantization format for LLMs integrating TurboQuant (TQ), a rotation-domain strategy based on the Fast Walsh-Hadamard Transform (FWHT). Conventional 3-bit methods suffer precision loss from heavy-tailed weight distributions and inter-channel outliers. ITQ3_S pre-rotates the weight space via FWHT before quantization, spreading outlier energy across the vector and inducing a near-Gaussian distribution amenable to uniform ternary coding. We derive a rigorous dequantization procedure fusing a 256-point Inverse FWHT into the CUDA shared-memory loading stage, ensuring reconstruction error is bounded exclusively by the ternary quantization grid with no additional error from the transform inversion. For any weight vector $\mathbf{w} \in \mathbb{R}^{256}$, the reconstruction satisfies $\|\hat{\mathbf{w}} - \mathbf{w}\|_2 \leq ε_q$, strictly smaller than uniform 3-bit baselines that do not exploit rotation-induced distribution normalization. TurboQuant lacks a native CUDA kernel, precluding direct deployment; naively composing TQ with existing weight quantizers introduces domain mismatch errors that accumulate across layers, degrading quality below standard 3-bit baselines. ITQ3_S resolves this by co-designing the FWHT rotation and quantization kernel as a unified pipeline grounded in the IQ3_S weight format, with the inverse transform fused into the CUDA MMQ kernel. Empirically, on the NVIDIA RTX 5090 (Blackwell), ITQ3_S achieves perplexity competitive with FP16 while delivering throughput exceeding 1.5x that of 4-bit alternatives via optimized DP4A and Tensor Core scheduling. Our results establish ITQ3_S as a practical, mathematically grounded solution for high-fidelity LLM deployment on consumer hardware.
Abstract（参考訳）: 本稿では,FWHT(Fast Walsh-Hadamard Transform)に基づく回転領域戦略であるTurboQuant(TQ)を統合するLLMの新しい3ビット重み量子化フォーマットであるITQ3_S(Interleaved Ternary Quantization -- Specialized)を提案する。従来の3ビット法では、重み付き重み分布とチャネル間外周による精度の低下が見られた。 ITQ3_Sは、量子化の前にFWHTを介して重み空間を前回転させ、ベクトルに外周エネルギーを分散させ、三次符号化に一様となる準ガウス分布を誘導する。 256点の逆FWHTをCUDA共有メモリローディングステージに組み込んだ厳密な量子化手順を導出し、変換反転から付加的な誤差を伴わない3次量子化グリッドのみに再構成誤差が拘束されることを保証する。任意の重みベクトル $\mathbf{w} \in \mathbb{R}^{256}$ に対して、再構成は $\|\hat{\mathbf{w}} - \mathbf{w}\|_2 \leq ε_q$ を満たす。 TurboQuantにはネイティブのCUDAカーネルがなく、直接デプロイを控えている。既存の重み付け量化器でTQをネリシブに構成すると、レイヤ間で蓄積されるドメインミスマッチエラーが発生し、標準の3ビットベースラインよりも品質が低下する。 ITQ3_S は FWHT 回転および量子化カーネルを IQ3_S 重み形式に基づく統一パイプラインとして設計し、逆変換を CUDA MMQ カーネルに融合することでこれを解決する。経験的に、NVIDIA RTX 5090 (Blackwell)では、ITQ3_SはFP16と競合し、最適化されたDP4AとTensor Coreスケジューリングによって、4ビット代替の1.5倍のスループットを提供する。この結果から,ITQ3_S は,消費者向けハードウェア上での高忠実性 LLM 展開のための実用的,数学的基礎的なソリューションとして確立された。

論文の概要: ITQ3_S: High-Fidelity 3-bit LLM Inference via Interleaved Ternary Quantization with Rotation-Domain Smoothing

関連論文リスト