Fugu-MT 論文翻訳(概要): MLP-Offload: Multi-Level, Multi-Path Offloading for LLM Pre-training to Break the GPU Memory Wall

論文の概要: MLP-Offload: Multi-Level, Multi-Path Offloading for LLM Pre-training to Break the GPU Memory Wall

arxiv url: http://arxiv.org/abs/2509.02480v1
Date: Tue, 02 Sep 2025 16:30:49 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-04 15:17:04.104747
Title: MLP-Offload: Multi-Level, Multi-Path Offloading for LLM Pre-training to Break the GPU Memory Wall
Title（参考訳）: MLP-Offload:GPUメモリウォールを壊すためのLLM事前トレーニングのためのマルチレベルマルチパスオフロード
Authors: Avinash Maurya, M. Mustafa Rafique, Franck Cappello, Bogdan Nicolae,
Abstract要約: 本稿では,資源制約のあるセットアップ上でのLLMトレーニングを最適化するための,新しいマルチレベルマルチパスオフロードエンジンを提案する。更新時のI/Oオーバーヘッドなど、イテレーションオフロードの設計を駆動する重要な観測をいくつか行います。 iteration-Offloadは、最先端のトレーニングランタイムと比較して2.5$times$高速であることを示す。
参考スコア（独自算出の注目度）: 2.3041368958484596
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Training LLMs larger than the aggregated memory of multiple GPUs is increasingly necessary due to the faster growth of LLM sizes compared to GPU memory. To this end, multi-tier host memory or disk offloading techniques are proposed by state of art. Despite advanced asynchronous multi-tier read/write strategies, such offloading strategies result in significant I/O overheads in the critical path of training, resulting in slower iterations. To this end, we propose MLP-Offload, a novel multi-level, multi-path offloading engine specifically designed for optimizing LLM training on resource-constrained setups by mitigating I/O bottlenecks. We make several key observations that drive the design of MLP-Offload, such as I/O overheads during the update dominate the iteration time; I/O bandwidth of the third-level remote storage tier remains unutilized; and, contention due to concurrent offloading amplifies I/O bottlenecks. Driven by these insights, we design and implement MLP-Offload to offload the optimizer states across multiple tiers in a cache-efficient and concurrency-controlled fashion to mitigate I/O bottlenecks during the backward and update phases. Evaluations on models up to 280B parameters shows that MLP-Offload achieves 2.5$\times$ faster iterations compared to the state-of-the-art LLM training runtimes.
Abstract（参考訳）: 複数のGPUの集約メモリより大きいLLMのトレーニングは、GPUメモリと比較してLLMサイズが速く成長するため、ますます必要になる。この目的のために,多層ホストメモリやディスクオフロード技術が最先端技術によって提案されている。高度な非同期多層読み込み/書き込み戦略にもかかわらず、そのようなオフロード戦略はトレーニングのクリティカルパスにおいて大きなI/Oオーバーヘッドをもたらし、イテレーションが遅くなる。この目的のために,資源制約のあるセットアップにおけるLLMトレーニングの最適化を目的とした,新たなマルチレベルマルチパスオフロードエンジンであるMLP-Offloadを提案する。更新時のI/Oオーバーヘッドがイテレーション時間を支配すること,第3レベルのリモートストレージ層のI/O帯域幅が未利用であること,同時オフロードによる競合がI/Oボトルネックを増幅すること,など,MLP-Offloadの設計を駆動する重要な観測を行う。これらの知見に基づいて、キャッシュ効率と並行性制御の方法で最適化状態のオフロードを設計、実装し、後方および更新フェーズにおけるI/Oボトルネックを軽減する。 MLP-Offloadが2.5$\times$高速なイテレーションを達成していることが、280Bパラメータまでのモデルで評価されている。

論文の概要: MLP-Offload: Multi-Level, Multi-Path Offloading for LLM Pre-training to Break the GPU Memory Wall

関連論文リスト