Fugu-MT 論文翻訳(概要): Pangu Ultra MoE: How to Train Your Big MoE on Ascend NPUs

論文の概要: Pangu Ultra MoE: How to Train Your Big MoE on Ascend NPUs

arxiv url: http://arxiv.org/abs/2505.04519v1
Date: Wed, 07 May 2025 15:46:36 GMT
ステータス: 翻訳完了
システム内更新日: 2025-05-08 19:07:36.135457
Title: Pangu Ultra MoE: How to Train Your Big MoE on Ascend NPUs
Title（参考訳）: Pangu Ultra MoE:NPUで大きなMoEをトレーニングする方法
Authors: Yehui Tang, Yichun Yin, Yaoyuan Wang, Hang Zhou, Yu Pan, Wei Guo, Ziyang Zhang, Miao Rang, Fangcheng Liu, Naifu Zhang, Binghan Li, Yonghan Dong, Xiaojun Meng, Yasheng Wang, Dong Li, Yin Li, Dandan Tu, Can Chen, Youliang Yan, Fisher Yu, Ruiming Tang, Yunhe Wang, Botian Huang, Bo Wang, Boxiao Liu, Changzheng Zhang, Da Kuang, Fei Liu, Gang Huang, Jiansheng Wei, Jiarui Qin, Jie Ran, Jinpeng Li, Jun Zhao, Liang Dai, Lin Li, Liqun Deng, Peifeng Qin, Pengyuan Zeng, Qiang Gu, Shaohua Tang, Shengjun Cheng, Tao Gao, Tao Yu, Tianshu Li, Tianyu Bi, Wei He, Weikai Mao, Wenyong Huang, Wulong Liu, Xiabing Li, Xianzhi Yu, Xueyu Wu, Xu He, Yangkai Du, Yan Xu, Ye Tian, Yimeng Wu, Yongbing Huang, Yong Tian, Yong Zhu, Yue Li, Yufei Wang, Yuhang Gai, Yujun Li, Yu Luo, Yunsheng Ni, Yusen Sun, Zelin Chen, Zhe Liu, Zhicheng Liu, Zhipeng Tu, Zilin Ding, Zongyuan Zhan,
Abstract要約: ミキチャー・オブ・エキスパート(MoE)と1兆近いパラメータを持つ疎大言語モデル(LLM)が、最も有能な言語モデルの領域を支配している。本稿では,Ascend NPU上でそのようなスケールを利用するレシピを明らかにすることを目的としている。主な目的は、動的スパースモデル構造下でのコンピューティングリソースのより良い使用と、実際のハードウェアで期待されるパフォーマンス向上の実現である。
参考スコア（独自算出の注目度）: 111.69640966866059
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Sparse large language models (LLMs) with Mixture of Experts (MoE) and close to a trillion parameters are dominating the realm of most capable language models. However, the massive model scale poses significant challenges for the underlying software and hardware systems. In this paper, we aim to uncover a recipe to harness such scale on Ascend NPUs. The key goals are better usage of the computing resources under the dynamic sparse model structures and materializing the expected performance gain on the actual hardware. To select model configurations suitable for Ascend NPUs without repeatedly running the expensive experiments, we leverage simulation to compare the trade-off of various model hyperparameters. This study led to Pangu Ultra MoE, a sparse LLM with 718 billion parameters, and we conducted experiments on the model to verify the simulation results. On the system side, we dig into Expert Parallelism to optimize the communication between NPU devices to reduce the synchronization overhead. We also optimize the memory efficiency within the devices to further reduce the parameter and activation management overhead. In the end, we achieve an MFU of 30.0% when training Pangu Ultra MoE, with performance comparable to that of DeepSeek R1, on 6K Ascend NPUs, and demonstrate that the Ascend system is capable of harnessing all the training stages of the state-of-the-art language models. Extensive experiments indicate that our recipe can lead to efficient training of large-scale sparse language models with MoE. We also study the behaviors of such models for future reference.
Abstract（参考訳）: ミキチャー・オブ・エキスパート(MoE)と1兆近いパラメータを持つ疎大言語モデル(LLM)が、最も有能な言語モデルの領域を支配している。しかし、大規模なモデルスケールは、基盤となるソフトウェアとハードウェアシステムに重大な課題をもたらします。本稿では,Ascend NPU上でそのようなスケールを利用するレシピを明らかにすることを目的としている。主な目的は、動的スパースモデル構造下でのコンピューティングリソースのより良い使用と、実際のハードウェアで期待されるパフォーマンス向上の実現である。コストのかかる実験を繰り返し行わずに、Ascend NPUに適したモデル構成を選択するために、シミュレーションを活用し、様々なモデルハイパーパラメータのトレードオフを比較する。本研究は, 718億のパラメータを持つスパースLLMであるPangu Ultra MoEに導かれ, シミュレーション結果の検証実験を行った。システム側では、NPUデバイス間の通信を最適化し、同期オーバーヘッドを低減するために、Expert Parallelismを掘り下げる。また、デバイス内のメモリ効率を最適化し、パラメータとアクティベーション管理のオーバーヘッドをさらに軽減します。最後に,6K Ascend NPU上でのPangu Ultra MoEのトレーニングにおいて,Pangu Ultra MoEの30.0%のMFUを実現し,最先端言語モデルのトレーニング段階をすべて活用できることを実証した。大規模な実験により,本手法はMoEを用いた大規模スパース言語モデルの効率的な訓練に繋がる可能性が示唆された。また,このようなモデルの挙動を今後の知見として検討する。

論文の概要: Pangu Ultra MoE: How to Train Your Big MoE on Ascend NPUs

関連論文リスト