Fugu-MT 論文翻訳(概要): Smoothing DiLoCo with Primal Averaging for Faster Training of LLMs

論文の概要: Smoothing DiLoCo with Primal Averaging for Faster Training of LLMs

arxiv url: http://arxiv.org/abs/2512.17131v1
Date: Thu, 18 Dec 2025 23:59:03 GMT
ステータス: 翻訳完了
システム内更新日: 2025-12-22 19:25:54.203585
Title: Smoothing DiLoCo with Primal Averaging for Faster Training of LLMs
Title（参考訳）: LLMの高速トレーニングのためのプリマル平均化によるDiLoCoの平滑化
Authors: Aaron Defazio, Konstantin Mishchenko, Parameswaran Raman, Hao-Jun Michael Shi, Lin Xiao,
Abstract要約: Generalized Primal Averaging (GPA)は、Nesterovのプライマリ平均化法の拡張である。 GPAは、Nesterovの一次平均化定式化において定数を分離することで制限を克服する。 Llama-160Mモデルでは、GPAはベースライン(AdamW)のバリデーション損失に到達するためのステップで24.22%のスピードアップを提供する。
参考スコア（独自算出の注目度）: 23.139573772811513
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We propose Generalized Primal Averaging (GPA), an extension of Nesterov's method in its primal averaging formulation that addresses key limitations of recent averaging-based optimizers such as single-worker DiLoCo and Schedule-Free (SF) in the non-distributed setting. These two recent algorithmic approaches improve the performance of base optimizers, such as AdamW, through different iterate averaging strategies. Schedule-Free explicitly maintains a uniform average of past weights, while single-worker DiLoCo performs implicit averaging by periodically aggregating trajectories, called pseudo-gradients, to update the model parameters. However, single-worker DiLoCo's periodic averaging introduces a two-loop structure, increasing its memory requirements and number of hyperparameters. GPA overcomes these limitations by decoupling the interpolation constant in the primal averaging formulation of Nesterov. This decoupling enables GPA to smoothly average iterates at every step, generalizing and improving upon single-worker DiLoCo. Empirically, GPA consistently outperforms single-worker DiLoCo while removing the two-loop structure, simplifying hyperparameter tuning, and reducing its memory overhead to a single additional buffer. On the Llama-160M model, GPA provides a 24.22% speedup in terms of steps to reach the baseline (AdamW's) validation loss. Likewise, GPA achieves speedups of 12% and 27% on small and large batch setups, respectively, to attain AdamW's validation accuracy on the ImageNet ViT workload. Furthermore, we prove that for any base optimizer with regret bounded by $O(\sqrt{T})$, where $T$ is the number of iterations, GPA can match or exceed the convergence guarantee of the original optimizer, depending on the choice of interpolation constants.
Abstract（参考訳）: 本稿では,Nesterov の手法の拡張である Generalized Primal Averaging (GPA) を提案する。これら2つのアルゴリズム的アプローチは、異なる反復平均化戦略により、AdamWのようなベースオプティマイザの性能を改善する。 Schedule-Freeは過去の重量の均一な平均を明示的に維持する一方、シングルワーカーのDiLoCoは、疑似階調と呼ばれる軌道を周期的に集約して暗黙的な平均化を行い、モデルパラメータを更新する。しかし、シングルワーカーのDiLoCoの周期的平均化は2ループ構造を導入し、メモリ要求とハイパーパラメータの数を増やした。 GPAは、Nesterovの一次平均化定式化において、補間定数を分解することでこれらの制限を克服する。このデカップリングにより、GPAは各ステップでスムーズに平均的なイテレーションを実行でき、シングルワーカーのDiLoCoを一般化し、改善することができる。 GPAは、2ループ構造を取り除き、ハイパーパラメータチューニングを簡素化し、メモリオーバーヘッドを1つの追加バッファに減らしながら、シングルワーカのDiLoCoを一貫して上回っている。 Llama-160Mモデルでは、GPAはベースライン(AdamW)のバリデーション損失に到達するためのステップで24.22%のスピードアップを提供する。同様に、GPAは、ImageNet ViTのワークロードにおけるAdamWの検証精度を達成するために、それぞれ小さなバッチと大きなバッチのセットアップで12%と27%のスピードアップを達成する。さらに、残差が$O(\sqrt{T})$で有界な任意の基底オプティマイザに対して、$T$は反復数であり、GPAは補間定数の選択に応じて元のオプティマイザの収束保証を一致または超えることを証明している。

論文の概要: Smoothing DiLoCo with Primal Averaging for Faster Training of LLMs

関連論文リスト