Fugu-MT 論文翻訳(概要): Rethinking Shrinkage Bias in LLM FP4 Pretraining: Geometric Origin, Systemic Impact, and UFP4 Recipe

論文の概要: Rethinking Shrinkage Bias in LLM FP4 Pretraining: Geometric Origin, Systemic Impact, and UFP4 Recipe

arxiv url: http://arxiv.org/abs/2606.20381v1
Date: Thu, 18 Jun 2026 15:40:51 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-19 18:23:39.953979
Title: Rethinking Shrinkage Bias in LLM FP4 Pretraining: Geometric Origin, Systemic Impact, and UFP4 Recipe
Title（参考訳）: LLM FP4プレトレーニングにおける収縮バイアスの再考:幾何学的起源、系統的影響、UFP4の合成
Authors: Qian Zhao, Kunlong Chen, Changxin Tian, Zhonghui Jiang, Haitao Zhang, Chaofan Yu, Peijie Jiang, Mingliang Gong, Jia Liu, Ziqi Liu, Zhiqiang Zhang, Jun Zhou,
Abstract要約: ラウンドリングをdYのみに制限しながら,RHTを3つのトレーニングGEMMすべてに適用する4ビットトレーニングレシピを提案する。 Dense 1.5B、MoE 7.9B、MoE 124Bの長期事前訓練では、UFP4は強いE2M1ベースラインよりも低いBF16相対損失劣化を達成する。この結果から,将来の加速器はE1M2/INT4スタイルの4ビットグリッドを,E2M1とともに第一級の訓練プリミティブとしてサポートすべきであることが示唆された。
参考スコア（独自算出の注目度）: 31.895766254664167
License: http://creativecommons.org/licenses/by/4.0/
Abstract: FP4 training promises substantial reductions in memory and computation cost for LLM pretraining, yet current FP4 hardware paths and recipes, including NVIDIA Blackwell/Rubin-class systems and AMD MI350-series GPUs, remain centered on E2M1 data elements. In this study, we identify a fundamental limitation of that choice: non-uniform formats such as E2M1 inherently suffer from Shrinkage Bias, a systematic negative rounding error caused by the geometric asymmetry of their representable bins. We show that this bias accumulates multiplicatively across layers and is amplified by the Random Hadamard Transform (RHT), providing a unified explanation for the training instability observed in existing E2M1-based FP4 recipes. In contrast, uniform grids (E1M2/INT4) bypass this grid-geometry error and better convert the improved bucket utilization from RHT into higher quantization quality. Based on this finding, we propose UFP4, a uniform 4-bit training recipe that applies RHT to all three training GEMMs while restricting stochastic rounding to dY alone. On Dense 1.5B, MoE 7.9B, and MoE 124B long-run pretraining, UFP4 consistently achieves lower BF16-relative loss degradation than strong E2M1-based baselines, supported by scaling-law analysis and ablation studies. Our results suggest that future accelerators should support E1M2/INT4-style uniform 4-bit grids as first-class training primitives alongside E2M1.
Abstract（参考訳）: FP4トレーニングは、LLM事前トレーニングのメモリと計算コストを大幅に削減することを約束するが、現在のFP4ハードウェアパスとレシピ(NVIDIA Blackwell/RubinクラスシステムやAMD MI350シリーズGPUなど)は、E2M1データ要素に集中している。本研究では,E2M1 のような非一様形式は本質的にシュリンカジ・ビアス(Shrinkage Bias)に悩まされるが,これは表現可能なビンの幾何学的非対称性によって生じる体系的負の丸め誤差である。我々は,このバイアスが層間を多重的に蓄積し,Random Hadamard Transform (RHT) によって増幅されることを示し,既存のE2M1ベースのFP4レシピで見られるトレーニング不安定性の統一的な説明を提供する。対照的に、一様格子(E1M2/INT4)はこの格子形状誤差を回避し、改良されたバケット利用率をRHTから高い量子化品質に変換する。そこで本研究では,RHTを3つのトレーニングGEMMすべてに適用し,確率的ラウンドリングをdYのみに制限した4ビットトレーニングレシピであるUFP4を提案する。 Dense 1.5B、MoE 7.9B、MoE 124Bの長期事前トレーニングでは、UFP4は拡張法解析とアブレーション研究によって支持される強力なE2M1ベースラインよりもBF16相対損失の減少を一貫して達成している。この結果から,将来の加速器はE1M2/INT4スタイルの4ビットグリッドを,E2M1とともに第一級の訓練プリミティブとしてサポートすべきであることが示唆された。

論文の概要: Rethinking Shrinkage Bias in LLM FP4 Pretraining: Geometric Origin, Systemic Impact, and UFP4 Recipe

関連論文リスト