Fugu-MT 論文翻訳(概要): CARMA: Collocation-Aware Resource Manager

論文の概要: CARMA: Collocation-Aware Resource Manager

arxiv url: http://arxiv.org/abs/2508.19073v2
Date: Sat, 01 Nov 2025 16:13:11 GMT
ステータス: 翻訳完了
システム内更新日: 2025-11-04 18:19:02.778061
Title: CARMA: Collocation-Aware Resource Manager
Title（参考訳）: CARMA:Collocation-Aware Resource Manager
Authors: Ehsan Yousefzadeh-Asl-Miandoab, Reza Karimzadeh, Bulat Ibragimov, Florina M. Ciorba, Pınar Tözün,
Abstract要約: 同じGPU上で複数のディープラーニング(DL)トレーニングタスクをコロケートすることは、利用率を改善するが、2つの大きなリスクをもたらす。サーバスケールのためのタスクレベル・コロケーション対応リソース管理システムであるCARMAについて述べる。
参考スコア（独自算出の注目度）: 5.998463702026698
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: GPUs running deep learning (DL) workloads are frequently underutilized. Collocating multiple DL training tasks on the same GPU can improve utilization but introduces two key risks: (1) out-of-memory (OOM) crashes for newly scheduled tasks, and (2) severe performance interference among co-running tasks, which can negate any throughput gains. These issues reduce system robustness, quality of service, and energy efficiency. We present CARMA, a task-level, collocation-aware resource management system for the server-scale. CARMA addresses collocation challenges via (1) fine-grained monitoring and bookkeeping of GPUs and a collocation risk analysis that filters out the high-risk GPUs; (2) task placement policies that cap GPU utilization to avoid OOMs and limit interference; (3) integration of GPU memory need estimators for DL tasks to minimize OOMs during collocation; and (4) a lightweight recovery method that relaunches jobs crashed due to OOMs. Our evaluation on a DL training workload derived from real-world traces shows that CARMA uses GPUs more efficiently by making more informed collocation decisions: for the best-performing collocation policy, CARMA increases GPU streaming multiprocessor (SM) utilization by 54%, the parallelism achieved per SM by 61%, and memory use by 62%. This results in a $\sim$35% and $\sim$15% reduction in the end-to-end execution time (makespan) and GPU energy consumption, respectively, for this workload.
Abstract（参考訳）: ディープラーニング(DL)ワークロードを実行するGPUは、しばしば未使用である。同じGPU上で複数のDLトレーニングタスクをコロケートすることは、利用を改善することができるが、(1)新しいスケジュールされたタスクのメモリ外障害(OOM)と(2)スループット向上を無効にするコランタスク間の厳しいパフォーマンス干渉という2つの大きなリスクをもたらす。これらの問題はシステムの堅牢性、サービス品質、エネルギー効率を低下させる。サーバスケールのためのタスクレベル・コロケーション対応リソース管理システムであるCARMAについて述べる。 CARMAは、(1)GPUのきめ細かい監視と簿記、(2)リスクの高いGPUをフィルタリングするコロケーションリスク分析、(2)OOMを避け、干渉を制限するためにGPUの利用を制限するタスク配置ポリシー、(3)GPUメモリの統合は、コロケーション中にOOMを最小化するためにDLタスクの予測器を必要とし、(4)OOMによってクラッシュしたジョブを再起動する軽量リカバリ手法によって、コロケーションの課題に対処する。実世界のトレースから導かれたDLトレーニングワークロードに対する評価では、CARMAは、より情報のあるコロケーション決定をすることで、より効率的にGPUを使用することを示す: 最高のパフォーマンスのコロケーションポリシーでは、CARMAはGPUストリーミングマルチプロセッサ(SM)の利用を54%、SM当たりの並列性は61%、メモリ使用量は62%向上する。これにより、このワークロードに対して、それぞれ$\sim$35%と$\sim$15%のエンドツーエンド実行時間(makespan)とGPUエネルギー消費を削減できる。

論文の概要: CARMA: Collocation-Aware Resource Manager

関連論文リスト