Fugu-MT 論文翻訳(概要): ARCQuant: Boosting NVFP4 Quantization with Augmented Residual Channels for LLMs

論文の概要: ARCQuant: Boosting NVFP4 Quantization with Augmented Residual Channels for LLMs

arxiv url: http://arxiv.org/abs/2601.07475v1
Date: Mon, 12 Jan 2026 12:27:22 GMT
ステータス: 翻訳完了
システム内更新日: 2026-01-13 19:08:01.378521
Title: ARCQuant: Boosting NVFP4 Quantization with Augmented Residual Channels for LLMs
Title（参考訳）: ARCQuant: LLM用拡張残留チャネルを用いたNVFP4量子化
Authors: Haoqian Meng, Yilun Luo, Yafei Zhao, Wenyuan Liu, Peng Zhang, Xindian Ma,
Abstract要約: ARCQuantは、Augmented Residual Channelsを通じてNVFP4パフォーマンスを向上させるフレームワークである。 ARCQuantは、複雑なタスクや下流タスクにおいて、完全精度のベースラインに匹敵する、最先端の精度を実現する。
参考スコア（独自算出の注目度）: 4.431548809730958
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The emergence of fine-grained numerical formats like NVFP4 presents new opportunities for efficient Large Language Model (LLM) inference. However, it is difficult to adapt existing Post-Training Quantization (PTQ) strategies to these formats: rotation-based methods compromise fine-grained block isolation; smoothing techniques struggle with significant 4-bit quantization errors; and mixed-precision approaches often conflict with hardware constraints on unified-precision computation. To address these challenges, we propose ARCQuant, a framework that boosts NVFP4 performance via Augmented Residual Channels. Distinct from methods that compromise block isolation or hardware uniformity, ARCQuant maintains a strictly unified NVFP4 format by augmenting the activation matrix with quantized residual channels. This design integrates the error compensation process directly into the matrix reduction dimension, enabling the use of standard, highly optimized GEMM kernels with minimal overhead. Theoretical analysis confirms that the worst-case error bound of our dual-stage NVFP4 quantization is comparable to that of standard 8-bit formats such as MXFP8. Extensive experiments on LLaMA and Qwen models demonstrate that ARCQuant achieves state-of-the-art accuracy, comparable to full-precision baselines in perplexity and downstream tasks. Furthermore, deployment on RTX 5090 and RTX PRO 6000 GPUs confirms practical benefits, achieving up to 3x speedup over FP16. Our code is available at https://github.com/actypedef/ARCQuant .
Abstract（参考訳）: NVFP4のような微粒な数値形式が出現すると、効率的なLarge Language Model(LLM)推論の新しい機会がもたらされる。しかし、これらのフォーマットに既存のPTQ(Post-Training Quantization)戦略を適用することは困難である。ローテーションベースの手法は、きめ細かいブロック分離を損なうこと、スムーズな手法は重要な4ビット量子化誤差に苦しむこと、混合精度アプローチは、統一精度計算におけるハードウェア制約と矛盾することが多い。これらの課題に対処するために、Augmented Residual Channelsを介してNVFP4パフォーマンスを向上させるARCQuantを提案する。 ARCQuantはブロック分離やハードウェアの統一性を損なう方法とは別として、アクティベーションマトリックスを量子化された残留チャネルで拡張することにより、厳密に統一されたNVFP4フォーマットを維持している。この設計は、誤差補償プロセスを行列還元次元に直接統合し、最小限のオーバーヘッドで標準的で高度に最適化されたGEMMカーネルの使用を可能にする。理論的解析により、二段NVFP4量子化の最悪のエラー境界は、MXFP8のような標準8ビットフォーマットに匹敵することがわかった。 LLaMAおよびQwenモデルに対する大規模な実験により、ARCQuantは、パープレキシティおよび下流タスクにおける完全精度ベースラインに匹敵する最先端の精度を達成することを示した。さらに、RTX 5090とRTX PRO 6000 GPUへのデプロイは、FP16よりも最大3倍のスピードアップを実現し、実用上のメリットを確認している。私たちのコードはhttps://github.com/actypedef/ARCQuant で利用可能です。

論文の概要: ARCQuant: Boosting NVFP4 Quantization with Augmented Residual Channels for LLMs

関連論文リスト