Fugu-MT 論文翻訳(概要): HiFloat4 Format for Language Model Pre-training on Ascend NPUs

論文の概要: HiFloat4 Format for Language Model Pre-training on Ascend NPUs

arxiv url: http://arxiv.org/abs/2604.08826v1
Date: Thu, 09 Apr 2026 23:50:56 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-13 17:57:53.612726
Title: HiFloat4 Format for Language Model Pre-training on Ascend NPUs
Title（参考訳）: HiFloat4 Format for Language Model Pre-training on Ascend NPUs (英語)
Authors: Mehran Taghian, Yunke Peng, Xing Huang, Yao Wang, Yaoyuan Wang, Wei Guo, Yuanyong Luo, Tianchi Hu, Junsong Wang, Xin Wang, Hu Liu, Yu Cheng, Ziwei Yu, Hongliang Li, Mehdi Rahimifar, Lei Yan, Xuefei Wang, Zhuang Ma, Lei Liu, Hui Yu, Anandharaju Durai Raju, Hoang Le, Hei Yi Mak, Tanzila Rahman, Shadan Golestan,
Abstract要約: 最近の研究は、4ビット浮動小数点(FP4)フォーマットが大規模言語モデル(LLM)における線形GEMM操作にうまく適用可能であることを示した。本研究では,Huawei Ascend NPU向けに最近提案されたHiFloat4 FP4フォーマットを調査し,大規模トレーニング環境でMXFP4と体系的に比較する。
参考スコア（独自算出の注目度）: 32.1837830814629
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large foundation models have become central to modern machine learning, with performance scaling predictably with model size and data. However, training and deploying such models incur substantial computational and memory costs, motivating the development of low-precision training techniques. Recent work has demonstrated that 4-bit floating-point (FP4) formats--such as MXFP4 and NVFP4--can be successfully applied to linear GEMM operations in large language models (LLMs), achieving up to 4x improvements in compute throughput and memory efficiency compared to higher-precision baselines. In this work, we investigate the recently proposed HiFloat4 FP4 format for Huawei Ascend NPUs and systematically compare it with MXFP4 in large-scale training settings. All experiments are conducted on Ascend NPU clusters, with linear and expert GEMM operations performed entirely in FP4 precision. We evaluate both dense architectures (e.g., Pangu and LLaMA-style models) and mixture-of-experts (MoE) models, where both standard linear layers and expert-specific GEMMs operate in FP4. Furthermore, we explore stabilization techniques tailored to FP4 training that significantly reduce numerical degradation, maintaining relative error within 1% of full-precision baselines while preserving the efficiency benefits of 4-bit computation. Our results provide a comprehensive empirical study of FP4 training on NPUs and highlight the practical trade-offs between FP4 formats in large-scale dense and MoE models.
Abstract（参考訳）: 大規模な基盤モデルは、モデルのサイズとデータで予測可能なパフォーマンスのスケーリングによって、現代の機械学習の中心となっている。しかし、そのようなモデルの訓練と展開は、かなりの計算とメモリコストをもたらし、低精度のトレーニング技術の開発を動機付けている。最近の研究は、MXFP4やNVFP4のような4ビット浮動小数点(FP4)フォーマットが、大規模言語モデル(LLM)における線形GEMM操作にうまく適用できることを示し、高い精度のベースラインに比べて最大4倍のスループットとメモリ効率を実現している。本研究では,Huawei Ascend NPU向けに最近提案されたHiFloat4 FP4フォーマットを調査し,大規模トレーニング環境でMXFP4と体系的に比較する。すべての実験はAscend NPUクラスタ上で行われ、線形かつ専門的なGEMM演算は完全にFP4精度で実行される。我々は,FP4において,標準的な線形層と専門的なGEMMの両方が動作するような,高密度アーキテクチャ(例えば,PanguとLLaMAスタイルのモデル)とMix-of-experts(MoE)モデルの両方を評価する。さらに,FP4トレーニングに適した安定化手法について検討し,数値劣化を著しく低減し,4ビット計算の効率性を維持しつつ,完全精度ベースラインの1%以内の相対誤差を維持する。本研究は,NPUにおけるFP4トレーニングの総合的研究であり,大規模密集モデルとMoEモデルにおけるFP4フォーマット間の実践的トレードオフを明らかにするものである。

論文の概要: HiFloat4 Format for Language Model Pre-training on Ascend NPUs

関連論文リスト