Fugu-MT 論文翻訳(概要): FlashOverlap: Minimizing Tail Latency in Communication Overlap for Distributed LLM Training

論文の概要: FlashOverlap: Minimizing Tail Latency in Communication Overlap for Distributed LLM Training

arxiv url: http://arxiv.org/abs/2604.24013v1
Date: Mon, 27 Apr 2026 03:48:21 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-28 17:12:07.727136
Title: FlashOverlap: Minimizing Tail Latency in Communication Overlap for Distributed LLM Training
Title（参考訳）: FlashOverlap: 分散LLMトレーニングのための通信オーバーラップにおける遅延最小化
Authors: Rezaul Karim, Austin Wen, Wang Zongzuo, Weiwei Zhang, Yang Liu, Walid Ahmed,
Abstract要約: 本研究は,通信遅延を解消する新しい通信計算オーバーラップ手法を提案する。本稿では,従来型のreduce-scatterとall-gatherを置き換えたFlash-Overlapという手法を提案する。本手法は通信オーバヘッドを低減し,テール遅延を解消するための正確なアルゴリズムを提供する。
参考スコア（独自算出の注目度）: 5.653799468368196
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The rapid growth in the size of large language models has necessitated the partitioning of computational workloads across accelerators such as GPUs, TPUs, and NPUs. However, these parallelization strategies incur substantial data communication overhead significantly hindering computational efficiency. While communication-computation overlap presents a promising direction, existing data slicing based solutions suffer from tail latency. To overcome this limitation, this research introduces a novel communication-computation overlap technique to eliminate this tail latency in state of the art overlap methods for distributed LLM training. The aim of this technique is to effectively mitigate communication bottleneck of tensor parallelism and data parallelism for distributed training and inference. In particular, we propose a novel method termed Flash-Overlap that replaces conventional collective operations of reduce-scatter and all-gather with decomposed peer-to-peer (P2P) communication and schedules partitioned computations to enable fine-grained overlap. Our method provides an exact algorithm for reducing communication overhead that eliminates tail latency. Moreover, it presents a versatile solution compatible with data-parallel training and various tensor-level parallelism strategies, including TPSP and UP. Experimental evaluations demonstrate that our technique consistently achieves lower latency, superior Model FLOPS Utilization (MFU), and high throughput.
Abstract（参考訳）: 大規模言語モデルのサイズが急速に拡大するにつれ、GPUやTPU、NPUといったアクセラレータ間での計算処理のパーティショニングが必要になった。しかし、これらの並列化戦略は、かなりのデータ通信オーバーヘッドを発生させ、計算効率を著しく損なう。通信-計算オーバーラップは有望な方向を示すが、既存のスライシングベースのソリューションはテール遅延に悩まされる。この制限を克服するために,分散LLMトレーニングのための最先端重複手法において,このテール遅延を解消する,新しい通信計算重複手法を提案する。この技術の目的は、分散トレーニングと推論のために、テンソル並列性とデータ並列性の通信ボトルネックを効果的に軽減することである。特に,従来の分散クラスタと全ガザの集合操作を分割されたピアツーピア通信(P2P)に置き換えたFlash-Overlapという手法を提案する。本手法は通信オーバヘッドを低減し,テール遅延を解消するための正確なアルゴリズムを提供する。さらに、データ並列トレーニングやTPSPやUPなど、様々なテンソルレベルの並列化戦略と互換性のある汎用的なソリューションを提供する。実験により,本手法は低レイテンシ,優れたモデルFLOPS利用(MFU),高スループットを実現していることが示された。

論文の概要: FlashOverlap: Minimizing Tail Latency in Communication Overlap for Distributed LLM Training

関連論文リスト