Fugu-MT 論文翻訳(概要): Characterizing the Impact of NVFP4 Quantization for Low-Power Edge AI Deployment

論文の概要: Characterizing the Impact of NVFP4 Quantization for Low-Power Edge AI Deployment

arxiv url: http://arxiv.org/abs/2606.06527v2
Date: Mon, 08 Jun 2026 03:25:22 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-09 14:42:05.058434
Title: Characterizing the Impact of NVFP4 Quantization for Low-Power Edge AI Deployment
Title（参考訳）: 低消費電力エッジAI展開におけるNVFP4量子化の影響の特徴
Authors: Ovishake Sen, Venkata Nithin Kamineni, Daniel Lobo, Swarup Bhunia, Rickard Ewetz, Baibhab Chatterjee,
Abstract要約: エッジでのエネルギー効率のよいニューラルネットワーク推論では、演算コスト、メモリトラフィック、エネルギ、ストレージオーバーヘッドを削減し、許容できる精度を維持する必要がある。本稿では,エッジ効率ニューラルネットワークにおけるNVFP4量子化のアブレーションに着目した研究を行う。
参考スコア（独自算出の注目度）: 9.460818703756205
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Energy-efficient neural-network inference at the edge requires reducing arithmetic cost, memory traffic, computation energy, and storage overhead while maintaining acceptable accuracy. This paper presents an ablation-focused study of NVFP4 quantization for edge-efficient neural networks, with emphasis on the relationship between activation precision, weight precision, block-size scaling, retraining, and model accuracy. NVFP4 activations are represented using 4-bit FP4 data, an FP8 block scale, and an FP32 tensor scale, enabling ultra-low precision inference while preserving activation dynamic range. A block-size ablation over six edge-efficient models shows that block size B = 16 provides a practical accuracy/storage trade-off, requiring only 4.5078 bits per input for N = 4096. A weight precision ablation further shows that FP8 and FP16 weights provide only modest gains over FP4 weights under the same NVFP4 activation path, suggesting that activation quantization and scaling dominate much of the accuracy behavior. To isolate the benefit of the NVFP4 data type, this work compares conventional unscaled FP4 activation inference and NVFP4 activation inference with and without retraining. The results show that conventional FP4 inference collapses accuracy for most compact models, while NVFP4 without retraining already recovers substantial accuracy by restoring activation dynamic range through FP8 block scaling and FP32 tensor scaling. When combined with retraining, NVFP4 achieves the best accuracy across the evaluated models, demonstrating the effectiveness of scaling-aware FP4 (NVFP4) inference. These findings provide general design guidance for hardware-software co-design of low power edge inference across a broad range of accelerator platforms, including GPUs, Tensor Cores, FPGAs, domain-specific AI accelerators, near-memory computing systems, and emerging edge-computing architectures.
Abstract（参考訳）: エッジでのエネルギー効率のよいニューラルネットワーク推論では、演算コスト、メモリトラフィック、計算エネルギー、ストレージオーバーヘッドを削減し、許容できる精度を維持する必要がある。本稿では, エッジ効率ニューラルネットワークにおけるNVFP4量子化のアブレーションに着目し, 活性化精度, 重み付け精度, ブロックサイズスケーリング, リトレーニング, モデル精度の関係に注目した。 NVFP4のアクティベーションは4ビットFP4データ、FP8ブロックスケール、FP32テンソルスケールで表現され、アクティベーションダイナミックレンジを維持しながら超低精度の推論を可能にする。 6つのエッジ効率のモデルに対するブロックサイズのアブレーションは、ブロックサイズB = 16が、N = 4096の入力に対して4.5078ビットしか必要とせず、実用的な精度/保存トレードオフを提供することを示している。さらに、FP8およびFP16重みは、同じNVFP4活性化経路の下でFP4重みよりも緩やかな利得しか得られないことが示され、活性化量子化とスケーリングが精度の挙動の大部分を占めていることが示唆された。 NVFP4データ型の利点を分離するために、従来の非スケールのFP4アクティベーション推論とNVFP4アクティベーション推論を、再トレーニングなしで比較した。その結果、従来のFP4推論は、ほとんどのコンパクトモデルの精度を低下させる一方、NVFP4は、FP8ブロックスケーリングとFP32テンソルスケーリングを通したアクティベーションダイナミックレンジの復元により、既にかなりの精度を回復していることがわかった。再トレーニングと組み合わせることで、NVFP4は評価モデル全体で最高の精度を達成し、スケーリング対応FP4(NVFP4)推論の有効性を示す。これらの発見は、GPU、Tensor Cores、FPGA、ドメイン固有のAIアクセラレータ、ニアメモリコンピューティングシステム、エッジコンピューティングアーキテクチャなど、幅広いアクセラレータプラットフォームにわたる、低電力エッジ推論のハードウェアソフトウェア共同設計のための一般的な設計ガイダンスを提供する。

論文の概要: Characterizing the Impact of NVFP4 Quantization for Low-Power Edge AI Deployment

関連論文リスト