Fugu-MT 論文翻訳(概要): CARMA: Collocation-Aware Resource Manager with GPU Memory Estimator

論文の概要: CARMA: Collocation-Aware Resource Manager with GPU Memory Estimator

arxiv url: http://arxiv.org/abs/2508.19073v1
Date: Tue, 26 Aug 2025 14:29:34 GMT
ステータス: 翻訳完了
システム内更新日: 2025-08-27 17:42:38.882959
Title: CARMA: Collocation-Aware Resource Manager with GPU Memory Estimator
Title（参考訳）: CARMA:GPUメモリ推定器を備えたコロケーション対応リソースマネージャ
Authors: Ehsan Yousefzadeh-Asl-Miandoab, Reza Karimzadeh, Bulat Ibragimov, Florina M. Ciorba, Pınar Tözün,
Abstract要約: GPUはディープラーニング(DL)トレーニングのコアとなる計算リソースである。 GPU上のDLタスクのコロケーションは、続くタスクのメモリ外クラッシュと、リソースの干渉によるGPUを共有するすべてのタスクのスローダウンをもたらす可能性がある。サーバスケールのタスクレベルのコロケーション対応リソース管理システムであるCARMAを提案する。
参考スコア（独自算出の注目度）: 5.998463702026698
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Studies conducted on enterprise-scale infrastructure have shown that GPUs -- the core computational resource for deep learning (DL) training -- are often significantly underutilized. DL task collocation on GPUs is an opportunity to address this challenge. However, it may result in (1) out-of-memory crashes for the subsequently arriving task and (2) slowdowns for all tasks sharing the GPU due to resource interference. The former challenge poses a threat to robustness, while the latter affects the quality of service and energy efficiency. We propose CARMA, a server-scale task-level collocation-aware resource management system that handles both collocation challenges. CARMA encompasses GPUMemNet, a novel ML-based GPU memory estimator framework for DL training tasks, to minimize out-of-memory errors and introduces collocation policies that cap GPU utilization to minimize interference. Furthermore, CARMA introduces a recovery method to ensure robust restart of tasks that crash. Our evaluation on traces modeled after real-world DL training task traces shows that CARMA increases the GPU utilization over time by 39.3\%, decreases the end-to-end execution time by $\sim$26.7\%, and reduces the GPU energy use by $\sim$14.2\%.
Abstract（参考訳）: エンタープライズ規模のインフラで実施された研究によると、ディープラーニング(DL)トレーニングのコアとなる計算リソースであるGPUは、しばしば著しく不使用であることが示されている。 GPU上のDLタスクのコロケーションはこの課題に対処する機会である。しかし、これは(1)次のタスクのメモリ外クラッシュ、(2)リソースの干渉によるGPUを共有するすべてのタスクのスローダウンをもたらす可能性がある。前者の課題は堅牢性への脅威であり、後者はサービスの品質とエネルギー効率に影響を与える。サーバスケールのタスクレベルのコロケーションを意識したリソース管理システムであるCARMAを提案する。 CARMAには、新しいMLベースのDLトレーニングタスク用のGPUメモリ推定フレームワークであるGPUMemNetが含まれており、メモリ外エラーを最小限に抑え、干渉を最小限に抑えるためにGPU利用を制限できるコロケーションポリシーを導入している。さらに、CARMAはクラッシュするタスクの堅牢な再起動を保証するリカバリ手法を導入している。実世界のDLトレーニングタスクトレースをモデルとしたトレース評価では、CARMAは時間の経過とともにGPUの利用量を39.3倍にし、エンドツーエンドの実行時間を26.7倍に削減し、GPUエネルギー使用量を14.2倍に削減している。

論文の概要: CARMA: Collocation-Aware Resource Manager with GPU Memory Estimator

関連論文リスト