Fugu-MT 論文翻訳(概要): Taming Unbalanced Training Workloads in Deep Learning with Partial Collective Operations

論文の概要: Taming Unbalanced Training Workloads in Deep Learning with Partial Collective Operations

arxiv url: http://arxiv.org/abs/1908.04207v4
Date: Tue, 19 Aug 2025 09:19:33 GMT
ステータス: 翻訳完了
システム内更新日: 2025-08-20 15:36:31.238565
Title: Taming Unbalanced Training Workloads in Deep Learning with Partial Collective Operations
Title（参考訳）: 部分的集団運用による深層学習におけるトレーニング負荷の非バランス化
Authors: Shigang Li, Tal Ben-Nun, Salvatore Di Girolamo, Dan Alistarh, Torsten Hoefler,
Abstract要約: 本稿では,分散的蓄積のためのグローバル同期を緩和するeager-SGDを提案する。本稿では,最先端同期SGDの1.27倍の高速化を実現し,精度を損なわないことを示す。
参考スコア（独自算出の注目度）: 49.26578529891149
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Load imbalance pervasively exists in distributed deep learning training systems, either caused by the inherent imbalance in learned tasks or by the system itself. Traditional synchronous Stochastic Gradient Descent (SGD) achieves good accuracy for a wide variety of tasks, but relies on global synchronization to accumulate the gradients at every training step. In this paper, we propose eager-SGD, which relaxes the global synchronization for decentralized accumulation. To implement eager-SGD, we propose to use two partial collectives: solo and majority. With solo allreduce, the faster processes contribute their gradients eagerly without waiting for the slower processes, whereas with majority allreduce, at least half of the participants must contribute gradients before continuing, all without using a central parameter server. We theoretically prove the convergence of the algorithms and describe the partial collectives in detail. Experimental results on load-imbalanced environments (CIFAR-10, ImageNet, and UCF101 datasets) show that eager-SGD achieves 1.27x speedup over the state-of-the-art synchronous SGD, without losing accuracy.
Abstract（参考訳）: 負荷不均衡は、学習タスクの固有の不均衡やシステム自体によって引き起こされる分散ディープラーニングトレーニングシステムに広く存在する。従来の同期SGD(Stochastic Gradient Descent)は、様々なタスクに対して高い精度を達成するが、グローバル同期に依存して各トレーニングステップで勾配を蓄積する。本稿では,分散的蓄積のためのグローバル同期を緩和するeager-SGDを提案する。本稿では,SGDを積極的に実装するために,ソロとマジョリティの2つの部分集合を用いることを提案する。ソロアレーダでは、遅いプロセスを待つことなく、より高速なプロセスが勾配に熱心に寄与する一方、過半数アレーダでは、参加者の少なくとも半数は、中央パラメータサーバーを使わずに、続く前に勾配に寄与しなければならない。理論的にはアルゴリズムの収束を証明し、部分集合を詳細に記述する。負荷不均衡環境(CIFAR-10, ImageNet, UCF101データセット)の実験結果から, 最先端同期SGDの1.27倍の高速化を実現し, 精度を損なうことなく達成できた。

論文の概要: Taming Unbalanced Training Workloads in Deep Learning with Partial Collective Operations

関連論文リスト