Fugu-MT 論文翻訳(概要): Qift: Shift-Friendly No-Zero W2 Post-Training Quantization for Rotated W2A4/KV4 LLM Inference

論文の概要: Qift: Shift-Friendly No-Zero W2 Post-Training Quantization for Rotated W2A4/KV4 LLM Inference

arxiv url: http://arxiv.org/abs/2606.02823v1
Date: Mon, 01 Jun 2026 19:40:32 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-03 22:00:04.569663
Title: Qift: Shift-Friendly No-Zero W2 Post-Training Quantization for Rotated W2A4/KV4 LLM Inference
Title（参考訳）: Qift: 回転W2A4/KV4 LLM推論のためのシフトフレンドリーな非ゼロW2ポストトレーニング量子化
Authors: Chi-Wei Huang, Chia-Chi Tsai,
Abstract要約: 2ビットの重み量子化はメモリ効率のLLM推論には魅力的である。標準のW2レベルセット-2,-1,0,+1は攻撃的なW2A4/KV4設定でしばしば崩壊する。回転W2A4/KV4推論のための固定ノゼロのW2レベルセットであるQiftを提案する。
参考スコア（独自算出の注目度）: 3.308743964406687
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Two-bit weight quantization is attractive for memory-efficient LLM inference, but the standard W2 level set {-2,-1,0,+1} often collapses under aggressive W2A4/KV4 settings. We study the scalar level-set geometry of two-bit weights in a Hadamard-rotated quantization pipeline. Conventional asymmetric W2 substantially improves over the standard level set, indicating that W2A4 failure is not only a bit-width problem but also a reconstruction-level problem. Across all 224 linear modules in each of LLaMA-2-7B and LLaMA-3.1-8B, pretrained weights are already nearly zero-centered, while Hadamard rotation primarily Gaussianizes their standardized shape: excess kurtosis and Q-Q error drop by orders of magnitude. Based on this approximate zero-centered Gaussian-like source model, we propose Qift, a fixed no-zero W2 level set for rotated W2A4/KV4 inference. The main level set is {+/-0.5, +/-1.5}, equivalently {+/-1, +/-3} under a half-scale reparameterization; a power-of-two variant uses {+/-1, +/-4} for sign-and-shift decoded weight application. Qift redesigns the fixed two-bit code-to-level mapping and is training-free, learned-codebook-free, group-grid-free, and zero-point-free, retaining the standard per-channel scale. A scale-invariant ratio analysis identifies an effective inner/outer centroid ratio range of 0.25 to 0.33, explaining why mirror no-zero (MNZ), Lloyd, NF2, and PoT-MNZ perform well while {+/-1, +/-2} does not. On both models, the no-zero level sets consistently improve pure W2A4 perplexity, L-layer mixed W2/W4 perplexity, downstream accuracy, and GPTQ residual behavior over the standard W2 level set. At L=16 mixed precision, they substantially narrow the gap to W3A4 while keeping half of the transformer layers at two-bit precision, giving a simple, source-aware, and deployment-friendly alternative to more complex learned W2 codebooks.
Abstract（参考訳）: 2ビットの重み量子化はメモリ効率のLLM推論には魅力的であるが、標準のW2レベルセット {-2,-1,0,+1} は攻撃的なW2A4/KV4設定でしばしば崩壊する。本研究では,アダマール回転量子化パイプラインにおける2ビット重みのスカラーレベルセット幾何について検討する。従来の非対称W2は標準レベルセットよりも大幅に改善され、W2A4の故障はビット幅の問題であるだけでなく、再構成レベルの問題でもある。 LLaMA-2-7BとLLaMA-3.1-8Bの各々の224個の線形加群のうち、事前訓練された重量は、ほぼゼロ中心であり、一方、アダマール回転は、過度のクルトシスとQ-Q誤差を桁違いに減少させることで、その標準形をガウシアン化する。この近似ゼロ中心ガウス的ソースモデルに基づいて、回転W2A4/KV4推論のための固定されたゼロでないW2レベルセットであるQiftを提案する。主レベル集合は {+/-0.5, +/-1.5} であり、半スケールのパラメータ化の下では {+/-1, +/-3} と等価である。 Qiftは固定された2ビットのコード・ツー・レベルマッピングを再設計し、トレーニングフリー、学習コードブックフリー、グループグリッドフリー、ゼロポイントフリーで、標準のチャネル単位のスケールを維持している。スケール不変比分析は、鏡のゼロ(MNZ)、ロイド、NF2、およびPoT-MNZが、 {+/-1, +/-2} がうまく機能しないのに対して、有効内外セントロイド比が 0.25 から 0.33 であることを示す。両モデルにおいて、ゼロでないレベルセットは、純粋W2A4パープレキシティ、L層混合W2/W4パープレキシティ、下流精度、標準W2レベルセットに対するGPTQ残差挙動を一貫して改善する。 L=16混合精度では、変換器の層の半分を2ビットの精度で保ちながらW3A4のギャップを著しく狭め、より複雑に学習されたW2コードブックに代わるシンプルなソース認識とデプロイメントフレンドリーな代替手段を提供する。

論文の概要: Qift: Shift-Friendly No-Zero W2 Post-Training Quantization for Rotated W2A4/KV4 LLM Inference

関連論文リスト