Fugu-MT 論文翻訳(概要): Tula: Optimizing Time, Cost, and Generalization in Distributed Large-Batch Training

論文の概要: Tula: Optimizing Time, Cost, and Generalization in Distributed Large-Batch Training

arxiv url: http://arxiv.org/abs/2603.18112v1
Date: Wed, 18 Mar 2026 13:56:32 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-20 17:19:05.774647
Title: Tula: Optimizing Time, Cost, and Generalization in Distributed Large-Batch Training
Title（参考訳）: Tula: 分散型大規模バッチトレーニングにおける時間、コスト、一般化の最適化
Authors: Sahil Tyagi, Feiyi Wang,
Abstract要約: Tulaは、畳み込みモデルの大規模なトレーニングのために、時間、コスト、収束品質を自動的に最適化するオンラインサービスである。 Tulaは、複数のモデルで7.5-14%のエラーでトレーニング時間とコストを予測し、全体の20倍のスピードアップを達成する。
参考スコア（独自算出の注目度）: 2.19670601855638
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Distributed training increases the number of batches processed per iteration either by scaling-out (adding more nodes) or scaling-up (increasing the batch-size). However, the largest configuration does not necessarily yield the best performance. Horizontal scaling introduces additional communication overhead, while vertical scaling is constrained by computation cost and device memory limits. Thus, simply increasing the batch-size leads to diminishing returns: training time and cost decrease initially but eventually plateaus, creating a knee-point in the time/cost versus batch-size pareto curve. The optimal batch-size therefore depends on the underlying model, data and available compute resources. Large batches also suffer from worse model quality due to the well-known generalization gap. In this paper, we present Tula, an online service that automatically optimizes time, cost, and convergence quality for large-batch training of convolutional models. It combines parallel-systems modeling with statistical performance prediction to identify the optimal batch-size. Tula predicts training time and cost within 7.5-14% error across multiple models, and achieves up to 20x overall speedup and improves test accuracy by 9% on average over standard large-batch training on various vision tasks, thus successfully mitigating the generalization gap and accelerating training at the same time.
Abstract（参考訳）: 分散トレーニングは、スケーリングアウト(ノードの追加)またはスケールアップ(バッチサイズの増加)によって、イテレーション毎に処理されるバッチの数を増やす。しかし、最大の構成が必ずしも最高のパフォーマンスをもたらすとは限らない。水平スケーリングでは通信オーバーヘッドが増加し、垂直スケーリングは計算コストとデバイスメモリ制限によって制限される。トレーニング時間とコストは最初は減少するが、最終的には高騰し、時間/コストとバッチサイズのパレート曲線の膝点が生成される。したがって、最適なバッチサイズは、基盤となるモデル、データ、利用可能な計算リソースに依存する。大規模なバッチは、よく知られた一般化のギャップのために、モデル品質の悪化にも悩まされる。本稿では,畳み込みモデルの大規模学習に要する時間,コスト,コンバージェンス品質を自動的に最適化するオンラインサービスであるTulaを紹介する。並列システムモデリングと統計的性能予測を組み合わせて最適なバッチサイズを特定する。 Tulaは、複数のモデルで7.5-14%の誤差でトレーニング時間とコストを予測し、全体的なスピードアップを最大20倍に向上し、様々なビジョンタスクにおける標準的な大規模バッチトレーニングよりも、テスト精度を9%向上させ、一般化ギャップを緩和し、同時にトレーニングを加速させることに成功した。

論文の概要: Tula: Optimizing Time, Cost, and Generalization in Distributed Large-Batch Training

関連論文リスト