Fugu-MT 論文翻訳(概要): Training LLMs with Fault Tolerant HSDP on 100,000 GPUs

論文の概要: Training LLMs with Fault Tolerant HSDP on 100,000 GPUs

arxiv url: http://arxiv.org/abs/2602.00277v1
Date: Fri, 30 Jan 2026 19:57:36 GMT
ステータス: 翻訳完了
システム内更新日: 2026-02-03 19:28:33.090422
Title: Training LLMs with Fault Tolerant HSDP on 100,000 GPUs
Title（参考訳）: 10万GPU上での耐故障性HSDPを用いたLDMのトレーニング
Authors: Omkar Salpekar, Rohan Varma, Kenny Yu, Vladimir Ivanov, Yang Wang, Ahmed Sharif, Min Si, Shawn Xu, Feng Tian, Shengbao Zheng, Tristan Rice, Ankush Garg, Shangfu Peng, Shreyas Siravara, Wenyin Fu, Rodrigo de Castro, Adithya Gangidi, Andrey Obraztsov, Sharan Narang, Sergey Edunov, Maxim Naumov, Chunqiang Tang, Mathew Oldham,
Abstract要約: 同期トレーニングは、頻繁な障害と長い回復時間により、効率が低下する。我々は、FT-HSDP(Fault Tolerant Hybrid-Shared Data Parallelism)を提案する。 FT-HSDPはフォールトトレランスの単位としてデータ並列レプリカを使用する。
参考スコア（独自算出の注目度）: 9.97532556913539
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large-scale training systems typically use synchronous training, requiring all GPUs to be healthy simultaneously. In our experience training on O(100K) GPUs, synchronous training results in a low efficiency due to frequent failures and long recovery time. To address this problem, we propose a novel training paradigm, Fault Tolerant Hybrid-Shared Data Parallelism (FT-HSDP). FT-HSDP uses data parallel replicas as units of fault tolerance. When failures occur, only a single data-parallel replica containing the failed GPU or server is taken offline and restarted, while the other replicas continue training. To realize this idea at scale, FT-HSDP incorporates several techniques: 1) We introduce a Fault Tolerant All Reduce (FTAR) protocol for gradient exchange across data parallel replicas. FTAR relies on the CPU to drive the complex control logic for tasks like adding or removing participants dynamically, and relies on GPU to perform data transfer for best performance. 2) We introduce a non-blocking catch-up protocol, allowing a recovering replica to join training with minimal stall. Compared with fully synchronous training at O(100K) GPUs, FT-HSDP can reduce the stall time due to failure recovery from 10 minutes to 3 minutes, increasing effective training time from 44\% to 80\%. We further demonstrate that FT-HSDP's asynchronous recovery does not bring any meaning degradation to the accuracy of the result model.
Abstract（参考訳）: 大規模なトレーニングシステムは一般的に同期トレーニングを使用し、すべてのGPUを同時に健全にする必要がある。 O(100K)GPUでのトレーニング経験では、頻繁な故障と長時間の回復により同期トレーニングが低効率になる。この問題に対処するために,新しいトレーニングパラダイムであるFault Tolerant Hybrid-Shared Data Parallelism (FT-HSDP)を提案する。 FT-HSDPはフォールトトレランスの単位としてデータ並列レプリカを使用する。障害が発生した場合、フェールしたGPUやサーバを含む単一のデータ並列レプリカのみがオフラインで再起動され、他のレプリカはトレーニングを継続する。このアイデアを大規模に実現するために、FT-HSDPはいくつかのテクニックを取り入れている。 1) データ並列レプリカ間の勾配交換のためのFTARプロトコルを導入する。 FTARはCPUに依存して、参加者の追加や削除といったタスクの複雑な制御ロジックを動的に駆動する。 2)ノンブロッキング・キャッチアッププロトコルを導入し,リカバリレプリカを最小限のストールでトレーニングに参加できるようにする。 O(100K) GPUの完全同期トレーニングと比較して、FT-HSDPは障害回復による停止時間を10分から3分に短縮し、効果的なトレーニング時間を44\%から80\%に向上させることができる。さらに,FT-HSDPの非同期回復は,結果モデルの精度に意味のある劣化をもたらすものではないことを実証した。

論文の概要: Training LLMs with Fault Tolerant HSDP on 100,000 GPUs

関連論文リスト