Fugu-MT 論文翻訳(概要): Fantastic Pretraining Optimizers and Where to Find Them

論文の概要: Fantastic Pretraining Optimizers and Where to Find Them

arxiv url: http://arxiv.org/abs/2509.02046v1
Date: Tue, 02 Sep 2025 07:43:22 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-04 15:17:03.947614
Title: Fantastic Pretraining Optimizers and Where to Find Them
Title（参考訳）: ファンタスティック・プレトレーニング・オプティマイザとその発見方法
Authors: Kaiyue Wen, David Hall, Tengyu Ma, Percy Liang,
Abstract要約: AdamWは長い間、言語モデルの事前訓練において支配的な勾配だった。行列ベースの行列の高速化はモデルスケールに逆比例する。
参考スコア（独自算出の注目度）: 59.56075036649332
License: http://creativecommons.org/licenses/by/4.0/
Abstract: AdamW has long been the dominant optimizer in language model pretraining, despite numerous claims that alternative optimizers offer 1.4 to 2x speedup. We posit that two methodological shortcomings have obscured fair comparisons and hindered practical adoption: (i) unequal hyperparameter tuning and (ii) limited or misleading evaluation setups. To address these two issues, we conduct a systematic study of ten deep learning optimizers across four model scales (0.1B-1.2B parameters) and data-to-model ratios (1-8x the Chinchilla optimum). We find that fair and informative comparisons require rigorous hyperparameter tuning and evaluations across a range of model scales and data-to-model ratios, performed at the end of training. First, optimal hyperparameters for one optimizer may be suboptimal for another, making blind hyperparameter transfer unfair. Second, the actual speedup of many proposed optimizers over well-tuned baselines is lower than claimed and decreases with model size to only 1.1x for 1.2B parameter models. Thirdly, comparing intermediate checkpoints before reaching the target training budgets can be misleading, as rankings between two optimizers can flip during training due to learning rate decay. Through our thorough investigation, we find that all the fastest optimizers such as Muon and Soap, use matrices as preconditioners -- multiplying gradients with matrices rather than entry-wise scalars. However, the speedup of matrix-based optimizers is inversely proportional to model scale, decreasing from 1.4x over AdamW for 0.1B parameter models to merely 1.1x for 1.2B parameter models.
Abstract（参考訳）: 代替オプティマイザは1.4倍から2倍のスピードアップを提供するという多くの主張にもかかわらず、AdamWは長い間、言語モデルの事前トレーニングにおいて支配的なオプティマイザだった。 2つの方法論上の欠点は、公正な比較を曖昧にし、実践的採用を妨げていると仮定する。 (i)不平等なハイパーパラメータチューニング、及び (二限定的又は誤認的評価制度これら2つの問題に対処するために,4つのモデルスケール(0.1B-1.2Bパラメータ)とデータ-モデル比(チンチラ最適値の1-8倍)にまたがる10のディープラーニングオプティマイザの体系的研究を行った。公平かつ情報的な比較には,訓練終了後に実施される,厳密なハイパーパラメータチューニングと,モデルスケールおよびデータ-モデル比による評価が必要であることが判明した。第一に、ある最適化器に最適なハイパーパラメーターは、別の最適化器に最適である可能性があり、ブラインドハイパーパラメーター転送が不公平である。第二に、よく調整されたベースラインに対する多くのオプティマイザの実際の高速化は、要求よりも低く、1.2Bパラメータモデルではモデルサイズが1.1倍に減少する。第3に、目標のトレーニング予算に達する前に中間チェックポイントを比較することは誤解を招く可能性がある。徹底的な調査を通じて、MuonやSoapのような高速な最適化ツールはすべて、行列をプレコンディショナーとして使用しています。しかし、行列ベースのオプティマイザの高速化はモデルスケールに反比例し、0.1BパラメータモデルではAdamWの1.4倍から1.2Bパラメータモデルでは1.1倍に減少する。

論文の概要: Fantastic Pretraining Optimizers and Where to Find Them

関連論文リスト