Fugu-MT 論文翻訳(概要): ITQ3_S: High-Fidelity 3-bit LLM Inference via Interleaved Ternary Quantization with Rotation-Domain Smoothing

論文の概要: ITQ3_S: High-Fidelity 3-bit LLM Inference via Interleaved Ternary Quantization with Rotation-Domain Smoothing

arxiv url: http://arxiv.org/abs/2603.27914v1
Date: Mon, 30 Mar 2026 00:03:22 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-31 23:18:45.1761
Title: ITQ3_S: High-Fidelity 3-bit LLM Inference via Interleaved Ternary Quantization with Rotation-Domain Smoothing
Title（参考訳）: ITQ3_S:回転領域平滑化を用いたインターリーブ3次量子化による高忠実3ビットLLM推論
Authors: Edward J. Yoon,
Abstract要約: 大規模言語モデル(LLM)のための新しい3ビット重み量子化フォーマットである textbfITQ3_S (Interleaved Ternary Quantization -- Specialized) を提案する。この結果から,ITQ3_Sは,コンシューマグレードハードウェア上での高忠実性LCM展開のための実用的,数学的基礎的なソリューションとして確立された。
参考スコア（独自算出の注目度）: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We present \textbf{ITQ3\_S} (Interleaved Ternary Quantization -- Specialized), a novel 3-bit weight quantization format for large language models (LLMs) that integrates \textbf{TurboQuant (TQ)}, a rotation-domain adaptive quantization strategy based on the Fast Walsh-Hadamard Transform (FWHT). Conventional 3-bit quantization methods suffer from catastrophic precision loss caused by heavy-tailed weight distributions and inter-channel outliers. ITQ3\_S addresses this fundamental limitation by pre-rotating the weight space via FWHT prior to quantization, effectively spreading outlier energy across the entire vector and inducing a near-Gaussian distribution amenable to uniform ternary coding. Critically, we derive a mathematically rigorous dequantization procedure that inverts the FWHT exactly using a 256-point Inverse Walsh-Hadamard Transform fused into the CUDA shared-memory loading stage, ensuring zero-error round-trip fidelity between offline quantization and online inference. We prove that for any weight vector $\mathbf{w} \in \mathbb{R}^{256}$ processed by our pipeline, the reconstruction satisfies $\|\hat{\mathbf{w}} - \mathbf{w}\|_2 \leq ε_q$, where $ε_q$ is determined solely by the ternary quantization grid and is strictly smaller than any uniform 3-bit baseline under equal bit-budget constraints. Empirically, on the NVIDIA RTX 5090 (Blackwell architecture), ITQ3\_S achieves perplexity competitive with FP16 baselines while delivering throughput exceeding 1.5$\times$ that of 4-bit alternatives, owing to optimized DP4A and Tensor Core scheduling in the interleaved memory layout. Our results establish ITQ3\_S as a practical, mathematically grounded solution for high-fidelity LLM deployment on consumer-grade hardware.
Abstract（参考訳）: 本稿では,FWHT(Fast Walsh-Hadamard Transform)に基づく回転領域適応量子化戦略である \textbf{ITQ3\_S} (Interleaved Ternary Quantization -- Specialized) について述べる。従来の3ビット量子化法では、重み付き重み分布とチャネル間外乱によって引き起こされる破滅的な精度の損失に悩まされる。 ITQ3\_S は量子化に先立って重み空間を FWHT で前回転させ、ベクトル全体にわたってアウトリーエネルギを効果的に分散させ、一様三次符号化が可能な準ガウス分布を誘導することによって、この基本的な制限に対処する。 FWHTを256点の逆ウォルシュ・アダマール変換で正確に逆転し、CUDAの共有メモリロードステージに融合し、オフライン量子化とオンライン推論のゼロエラーラウンドトリップの完全性を保証する数学的に厳密な量子化手順を導出する。任意の重みベクトル $\mathbf{w} \in \mathbb{R}^{256}$ がパイプラインによって処理された場合、再構成は$\|\hat{\mathbf{w}} - \mathbf{w}\|_2 \leq ε_q$ を満たす。経験的に、NVIDIA RTX 5090(Blackwellアーキテクチャ)では、ITQ3\_SはFP16ベースラインと競合するパープレキシティを実現し、スループットは1.5$\times$4ビットの代替として最適化されたDP4AとTensor Coreをインターリーブドメモリレイアウトでスケジューリングする。この結果から,ITQ3\_S は,消費者向けハードウェア上での高忠実性 LLM 展開のための実用的,数学的基礎的なソリューションとして確立された。

論文の概要: ITQ3_S: High-Fidelity 3-bit LLM Inference via Interleaved Ternary Quantization with Rotation-Domain Smoothing

関連論文リスト