Fugu-MT 論文翻訳(概要): On Harnessing Idle Compute at the Edge for Foundation Model Training

論文の概要: On Harnessing Idle Compute at the Edge for Foundation Model Training

arxiv url: http://arxiv.org/abs/2512.22142v1
Date: Sat, 13 Dec 2025 20:57:43 GMT
ステータス: 翻訳完了
システム内更新日: 2026-01-04 08:45:17.063942
Title: On Harnessing Idle Compute at the Edge for Foundation Model Training
Title（参考訳）: 基礎モデルトレーニングのためのエッジにおけるアイドル計算のハーネスについて
Authors: Leyang Xue, Meghana Madhyastha, Myungjin Lee, Amos Storkey, Randal Burns, Mahesh K. Marina,
Abstract要約: 我々はCleaveを紹介し、新しい選択型ハイブリッドテンソル並列化法により、トレーニング操作を微妙に分割する。 Cleaveは、大規模なモデルや数千のデバイスに効率的にスケーリングすることで、クラウドベースのGPUトレーニングにマッチし、ベースラインのエッジトレーニングアプローチよりも最大8倍のデバイスをサポートする。最先端のエッジトレーニング手法を、バッチ毎のトレーニング時間で最大10倍に向上し、デバイス障害を効率的に処理し、従来の方法よりも少なくとも100倍高速なリカバリを実現している。
参考スコア（独自算出の注目度）: 7.228241542082645
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: The ecosystem behind foundation model development today is highly centralized and limited to large-scale cloud data center operators: training foundation models is costly, needing immense compute resources. Decentralized foundation model training across edge devices, leveraging their spare compute, promises a democratized alternative. However, existing edge-training approaches fall short: they struggle to match cloud-based training performance, exhibit limited scalability with model size, exceed device memory capacity, and have prohibitive communication overhead. They also fail to satisfactorily handle device heterogeneity and dynamism. We introduce a new paradigm, Cleave, which finely partitions training operations through a novel selective hybrid tensor parallelism method. Together with a parameter server centric training framework, Cleave copes with device memory limits and avoids communication bottlenecks, thereby enabling efficient training of large models on par with the cloud. Further, with a cost optimization model to guide device selection and training workload distribution, Cleave effectively accounts for device heterogeneity and churn. Our evaluations show that Cleave matches cloud-based GPU training by scaling efficiently to larger models and thousands of devices, supporting up to 8x more devices than baseline edge-training approaches. It outperforms state-of-the-art edge training methods by up to a factor of 10 in per-batch training time and efficiently handles device failures, achieving at least 100x faster recovery than prior methods.
Abstract（参考訳）: 現在のファンデーションモデル開発の背景にあるエコシステムは、高度に集中し、大規模なデータセンターオペレータに限られています。エッジデバイスをまたいだ分散ファンデーションモデルトレーニングは、余分な計算を活用し、民主化された代替手段を約束する。しかし、既存のエッジトレーニングアプローチは不足している。クラウドベースのトレーニングパフォーマンスの整合性、モデルサイズによるスケーラビリティの制限、デバイスメモリ容量の超過、通信オーバーヘッドの禁止などだ。また、デバイスの不均一性とダイナミズムを十分に扱えない。新しいパラダイムであるCleaveを導入し、新しい選択型ハイブリッドテンソル並列化法により、トレーニング操作を微妙に分割する。パラメータサーバ中心のトレーニングフレームワークとともに、Cleaveはデバイスのメモリ制限に対処し、通信ボトルネックを回避することにより、クラウドと同等の大規模モデルの効率的なトレーニングを可能にする。さらに、デバイス選択とトレーニングワークロードの分散を導くためのコスト最適化モデルにより、Cleaveはデバイスの不均一性とチャーンを効果的に説明できる。我々の評価では、CleaveはクラウドベースのGPUトレーニングと一致し、大規模なモデルや数千台のデバイスに効率的にスケールし、ベースラインのエッジトレーニングアプローチよりも最大8倍のデバイスをサポートする。最先端のエッジトレーニング手法を、バッチ毎のトレーニング時間で最大10倍に向上し、デバイス障害を効率的に処理し、従来の方法よりも少なくとも100倍高速なリカバリを実現している。

論文の概要: On Harnessing Idle Compute at the Edge for Foundation Model Training

関連論文リスト