Fugu-MT 論文翻訳(概要): Scalable Training of Mixture-of-Experts Models with Megatron Core

論文の概要: Scalable Training of Mixture-of-Experts Models with Megatron Core

arxiv url: http://arxiv.org/abs/2603.07685v2
Date: Tue, 10 Mar 2026 06:23:58 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-11 12:59:13.034201
Title: Scalable Training of Mixture-of-Experts Models with Megatron Core
Title（参考訳）: Megatron Coreを用いたMixture-of-Expertsモデルのスケーラブルトレーニング
Authors: Zijie Yan, Hongxiao Bai, Xin Yao, Dennis Liu, Tong Liu, Hongbin Liu, Pingtian Li, Evan Wu, Shiqing Fan, Li Tao, Robin Zhang, Yuzhong Wang, Shifang Xu, Jack Chang, Xuwen Chen, Kunlun Li, Yan Bai, Gao Deng, Nan Zheng, Vijay Anand Korthikanti, Abhinav Khattar, Ethan He, Soham Govande, Sangkug Lym, Zhongbo Zhu, Qi Zhang, Haochen Yuan, Xiaowei Ren, Deyu Fu, Tailai Ma, Shunkang Zhang, Jiang Shao, Ray Wang, Vasudevan Rengasamy, Rachit Garg, Santosh Bhavani, Xipeng Li, Chandler Zhou, David Wu, Yingcan Wei, Ashwath Aithal, Michael Andersch, Mohammad Shoeybi, Jiajie Yao, June Yang,
Abstract要約: MOE(Scaling Mixture-of-Experts)トレーニングでは、密集したモデルに欠けているシステムの課題が導入されている。各トークンは専門家のサブセットのみを活性化するため、このスパーシリティにより、トータルパラメータはトーケン計算よりもはるかに高速に成長できる。メモリ(微細な再計算,オフロード,通信,計算)の統合最適化により,MoEトレーニングにおけるこれらの課題に対処する。
参考スコア（独自算出の注目度）: 26.9162079065285
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Scaling Mixture-of-Experts (MoE) training introduces systems challenges absent in dense models. Because each token activates only a subset of experts, this sparsity allows total parameters to grow much faster than per-token computation, creating coupled constraints across memory, communication, and computation. Optimizing one dimension often shifts pressure to another, demanding co-design across the full system stack. We address these challenges for MoE training through integrated optimizations spanning memory (fine-grained recomputation, offloading, etc.), communication (optimized dispatchers, overlapping, etc.), and computation (Grouped GEMM, fusions, CUDA Graphs, etc.). The framework also provides Parallel Folding for flexible multi-dimensional parallelism, low-precision training support for FP8 and NVFP4, and efficient long-context training. On NVIDIA GB300 and GB200, it achieves 1,233/1,048 TFLOPS/GPU for DeepSeek-V3-685B and 974/919 TFLOPS/GPU for Qwen3-235B. As a performant, scalable, and production-ready open-source solution, it has been used across academia and industry for training MoE models ranging from billions to trillions of parameters on clusters scaling up to thousands of GPUs. This report explains how these techniques work, their trade-offs, and their interactions at the systems level, providing practical guidance for scaling MoE models with Megatron Core.
Abstract（参考訳）: MOE(Scaling Mixture-of-Experts)トレーニングでは、密集したモデルに欠けているシステムの課題が導入されている。各トークンは専門家のサブセットだけを起動するので、このスパーシリティにより、トータルパラメータはトーケン毎の計算よりもはるかに高速に成長し、メモリ、通信、計算にまたがる制約を生成することができる。ある次元を最適化することは、しばしば圧力を別の次元にシフトさせ、システムスタック全体にわたって共同設計を要求する。メモリ(微粒な再計算、オフロードなど)、通信(最適化されたディスパッチ、オーバーラップなど)、計算(グループGEMM、融合、CUDAグラフなど)にまたがるMoEトレーニングの課題に対処する。このフレームワークは、フレキシブルな多次元並列処理のためのParallel Folding、FP8とNVFP4の低精度トレーニングサポート、より効率的なロングコンテキストトレーニングを提供する。 NVIDIA GB300とGB200では、DeepSeek-V3-685Bで1,233/1,048 TFLOPS/GPU、Qwen3-235Bで974/919 TFLOPS/GPUを達成した。パフォーマンスが高く、スケーラブルで、プロダクション対応のオープンソースソリューションとして、数千のGPUにスケールアップするクラスタ上で、数十億から数兆のパラメータを含むMoEモデルをトレーニングするために、学界や業界で使用されている。本稿では,これらの技術がどのように機能し,そのトレードオフとシステムレベルでの相互作用を解説し,Megatron CoreでMoEモデルをスケールするための実践的なガイダンスを提供する。

論文の概要: Scalable Training of Mixture-of-Experts Models with Megatron Core

関連論文リスト