Fugu-MT 論文翻訳(概要): Pretraining large language models with MXFP4 on Native FP4 Hardware

論文の概要: Pretraining large language models with MXFP4 on Native FP4 Hardware

arxiv url: http://arxiv.org/abs/2605.09825v2
Date: Wed, 13 May 2026 04:29:10 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-14 17:13:58.841436
Title: Pretraining large language models with MXFP4 on Native FP4 Hardware
Title（参考訳）: MXFP4によるFP4ネイティブハードウェア上の大規模言語モデルの事前学習
Authors: Musa Cim, Poovaiah Palangappa, Miro Hodak, Ravi Dwivedula, Meena Arunachalam, Mahmut Taylan Kandemir,
Abstract要約: 我々は,前向きのアクティベーションやアクティベーション勾配が安定している場合でも,大規模言語モデルのフルパイプFP4トレーニングがしばしば分岐する理由を考察する。その結果,FP4トレーニングの不安定性は,過度な直感性ではなく,敏感な勾配経路に沿った構造的マイクロスケーリング誤差によって引き起こされることがわかった。
参考スコア（独自算出の注目度）: 6.139566055770847
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Why does full-pipeline FP4 training of large language models often diverge, even when forward activations and activation gradients remain stable? We address this question through a controlled study of MXFP4 quantization in transformer training, progressively enabling FP4 across forward propagation (Fprop), activation gradients (Dgrad), and weight gradients (Wgrad) while holding all other factors fixed. In full pretraining of Llama 3.1-8B on the C4 dataset, we observe that quantizing Wgrad is the primary driver of convergence degradation, whereas FP4 in Fprop and Dgrad alone introduces only modest additional token requirements. To interpret this behavior, we evaluate both structured and stochastic interventions under a controlled experimental setting. We find that stochastic rounding and randomized Hadamard rotations fail to stabilize training once Wgrad is quantized, whereas deterministic Hadamard rotations consistently restore stable optimization. These results suggest that FP4 training instability is driven by structured micro-scaling errors along sensitive gradient paths, rather than by insufficient stochasticity. We run experiments with native MXFP4 support on AMD Instinct MI355X GPUs, enabling controlled investigation of these effects without reliance on software emulation.
Abstract（参考訳）: フォワードアクティベーションとアクティベーション勾配が安定している場合でも、大規模言語モデルのフルパイプFP4トレーニングはなぜ分岐するのか? 本研究は, トランスフォーマートレーニングにおけるMXFP4量子化の制御研究を通じて, 前方伝播(Fprop), アクティベーション勾配(Dgrad), ウェイト勾配(Wgrad)でFP4を段階的に有効にし, その他のすべての要因を固定しながらこの問題に対処する。 C4データセット上でのLlama 3.1-8Bの完全な事前トレーニングでは、Wgradの定量化が収束分解の第一の要因であるのに対し、FpropとDgradのFP4はわずかに追加のトークン要件を導入する。この振る舞いを解釈するために,制御された実験環境下での構造的介入と確率的介入の両方を評価する。確率的ラウンドリングとランダム化されたアダマール回転は、Wgradが量子化されるとトレーニングを安定させることができず、一方決定論的アダマール回転は一貫して安定な最適化を安定させる。以上の結果から,FP4トレーニング不安定性は,不整合性ではなく,過敏な勾配経路に沿った構造的マイクロスケーリング誤差によって引き起こされることが示唆された。我々はAMD Instinct MI355X GPU上でMXFP4のネイティブサポートによる実験を行い、ソフトウェアエミュレーションに頼らずにこれらの効果の制御を可能にする。

論文の概要: Pretraining large language models with MXFP4 on Native FP4 Hardware

関連論文リスト