Fugu-MT 論文翻訳(概要): Runtime-Orchestrated Second-Order Optimization for Scalable LLM Training

論文の概要: Runtime-Orchestrated Second-Order Optimization for Scalable LLM Training

arxiv url: http://arxiv.org/abs/2605.16184v1
Date: Fri, 15 May 2026 17:03:55 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-18 21:22:26.382893
Title: Runtime-Orchestrated Second-Order Optimization for Scalable LLM Training
Title（参考訳）: 拡張性LLMトレーニングのための実行時整合2次最適化
Authors: Yishun Lu, Junhao Zhang, Zeyu Yang, Wes Armour,
Abstract要約: 重要なGPUトレーニングパスから2階最適化ロジックを分離するランタイムシステムである textAsteria を導入する。 Asteriaは、アーキテクチャ上の制約に従って、動的に状態をGPUメモリ、CPUメモリ、オプションストレージに分散する。メモリ制約と分散トレーニング設定の両方でAsteriaを評価する。
参考スコア（独自算出の注目度）: 4.950833328317384
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Second-order methods offer an attractive path toward more sample-efficient LLM training, but their practical use is often blocked by the systems cost of maintaining and updating large matrix-based optimizer states. We introduce \textbf{Asteria}, a runtime system designed to remove this bottleneck by separating second-order optimization logic from the critical GPU training path. Rather than keeping all preconditioner state on the accelerator, Asteria dynamically distributes optimizer state across GPU memory, CPU memory, and optional NVMe storage according to architectural constraints and runtime pressure. It further uses training hooks to prepare shadow states in advance, allowing expensive inverse-root computations to proceed asynchronously on the host while GPU computation continues. For distributed training, Asteria employs a bounded-staleness protocol that limits synchronization frequency while preserving optimizer effectiveness through topology-aware coordination. We evaluate Asteria on both memory-constrained and distributed training settings. On a DGX Spark platform with a single GB10 GPU and 128GB unified memory, Asteria supports second-order training for a 1B-parameter language model. On multi-node GH200 systems, it lowers visible optimizer overhead, reduces recurring latency spikes, accelerates convergence in wall-clock time, and maintains the optimization advantages of SOAP and KL-Shampoo in a 7B-parameter language model. Our results suggest that second-order LLM training can be made practical not by simplifying the optimizer alone, but by rethinking how optimizer state, background computation, and distributed synchronization are managed at the runtime level.
Abstract（参考訳）: 2階法は、よりサンプル効率のよいLLMトレーニングへの魅力的なパスを提供するが、大規模な行列ベースのオプティマイザ状態の維持と更新のシステムコストによって、その実用的利用はブロックされることが多い。重要なGPUトレーニングパスから2階最適化ロジックを分離することにより、このボトルネックを取り除くために設計されたランタイムシステムである‘textbf{Asteria} を紹介する。アクセル上のすべてのプレコンディショナ状態を維持する代わりに、Asteriaはアーキテクチャ上の制約と実行時のプレッシャーに応じて、GPUメモリ、CPUメモリ、オプションのNVMeストレージにオプティマイザ状態を動的に分散する。さらに、トレーニングフックを使用して、事前にシャドウ状態を準備し、GPU計算が継続している間に、高価な逆ルート計算がホスト上で非同期に進行することを可能にする。分散トレーニングにおいて、Asteriaは、トポロジ・アウェア・コーディネーションによるオプティマイザの有効性を維持しながら同期周波数を制限するバウンダリテッド・スタレネスプロトコルを使用している。メモリ制約と分散トレーニング設定の両方でAsteriaを評価する。 1GBのGPUと128GBの統一メモリを備えたDGX Sparkプラットフォームでは、Asteriaは1Bパラメータ言語モデルの2次トレーニングをサポートする。マルチノードのGH200システムでは、可視的なオプティマイザオーバーヘッドを低減し、繰り返し発生するレイテンシのスパイクを低減し、ウォールクロック時間の収束を加速し、7Bパラメータ言語モデルでSOAPとKL-Shampooの最適化の利点を維持する。この結果から,2次LLMトレーニングは,オプティマイザのみを簡素化するだけでなく,オプティマイザ状態,バックグラウンド計算,分散同期をランタイムレベルでどのように管理するかを再考することによって実現可能であることが示唆された。

論文の概要: Runtime-Orchestrated Second-Order Optimization for Scalable LLM Training

関連論文リスト