Fugu-MT 論文翻訳(概要): DuQuant++: Fine-grained Rotation Enhances Microscaling FP4 Quantization

論文の概要: DuQuant++: Fine-grained Rotation Enhances Microscaling FP4 Quantization

arxiv url: http://arxiv.org/abs/2604.17789v2
Date: Tue, 21 Apr 2026 07:37:48 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-22 14:04:47.929266
Title: DuQuant++: Fine-grained Rotation Enhances Microscaling FP4 Quantization
Title（参考訳）: DuQuant++: マイクロスケーリングFP4量子化を実現する微細な回転
Authors: Haokun Lin, Xinle Jia, Haobo Xu, Bingchen Yao, Xianglong Guo, Yichen Wu, Zhichao Lu, Ying Wei, Qingfu Zhang, Zhenan Sun,
Abstract要約: そこで我々は,DuQuantをMXFP4フォーマットに適応させるDuQuant++を提案する。 MXFP4 W4A4量子化の下でのLLaMA-3ファミリーの実験は、DuQuant++が一貫して最先端のパフォーマンスを実現していることを示している。
参考スコア（独自算出の注目度）: 47.19478866645546
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The MXFP4 microscaling format, which partitions tensors into blocks of 32 elements sharing an E8M0 scaling factor, has emerged as a promising substrate for efficient LLM inference, backed by native hardware support on NVIDIA Blackwell Tensor Cores. However, activation outliers pose a unique challenge under this format: a single outlier inflates the shared block scale, compressing the effective dynamic range of the remaining elements and causing significant quantization error. Existing rotation-based remedies, including randomized Hadamard and learnable rotations, are data-agnostic and therefore unable to specifically target the channels where outliers concentrate. We propose DuQuant++, which adapts the outlier-aware fine-grained rotation of DuQuant to the MXFP4 format by aligning the rotation block size with the microscaling group size (B{=}32). Because each MXFP4 group possesses an independent scaling factor, the cross-block variance issue that necessitates dual rotations and a zigzag permutation in the original DuQuant becomes irrelevant, enabling DuQuant++ to replace the entire pipeline with a single outlier-aware rotation, which halves the online rotation cost while simultaneously smoothing the weight distribution. Extensive experiments on the LLaMA-3 family under MXFP4 W4A4 quantization show that DuQuant++ consistently achieves state-of-the-art performance. Our code is available at https://github.com/Hsu1023/DuQuant-v2.
Abstract（参考訳）: E8M0スケーリングファクタを共有する32要素のブロックにテンソルを分割するMXFP4マイクロスケーリングフォーマットは、NVIDIA Blackwell Tensor Coresのネイティブハードウェアサポートによって、効率的なLCM推論のための有望な基盤として登場した。しかし、アクティベーション・アウトレイアは、共有ブロックスケールを膨張させ、残りの要素の有効ダイナミックレンジを圧縮し、重要な量子化誤差を引き起こす。ランダム化されたアダマールや学習可能な回転を含む既存の回転ベースの治療法は、データに依存しないため、アウトレーヤが集中するチャネルを特に標的にすることはできない。我々はDuQuant++を提案する。DuQuantは、マイクロスケーリンググループサイズ(B{=}32)とローテーションブロックサイズを整合させることで、DuQuantをMXFP4フォーマットに微調整する。各MXFP4群は独立したスケーリング係数を持つため、双対回転と元のDuQuantにおけるジグザグ置換を必要とするクロスブロック分散問題は無関係となり、DuQuant++はパイプライン全体を1つのアウトリア対応回転に置き換えることができ、同時に重量分布を平滑化しながらオンライン回転コストを削減できる。 MXFP4 W4A4量子化の下でのLLaMA-3ファミリーの大規模な実験は、DuQuant++が一貫して最先端のパフォーマンスを実現していることを示している。私たちのコードはhttps://github.com/Hsu1023/DuQuant-v2.comで利用可能です。

論文の概要: DuQuant++: Fine-grained Rotation Enhances Microscaling FP4 Quantization

関連論文リスト