Fugu-MT 論文翻訳(概要): Conda: Column-Normalized Adam for Training Large Language Models Faster

論文の概要: Conda: Column-Normalized Adam for Training Large Language Models Faster

arxiv url: http://arxiv.org/abs/2509.24218v1
Date: Mon, 29 Sep 2025 02:58:19 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-30 22:32:19.708583
Title: Conda: Column-Normalized Adam for Training Large Language Models Faster
Title（参考訳）: Conda: 大規模言語モデルのトレーニングを高速化するためのカラム非正規化アダム
Authors: Junjie Wang, Pan Zhou, Yiming Dong, Huan Li, Jia Li, Xun Zhou, Qicheng Lao, Cong Fang, Zhouchen Lin,
Abstract要約: 大規模言語モデル(LLM)は、目覚ましい一般化と創発的能力を示しているが、事前学習は計算コストが高く、最適化のダイナミクスに敏感である。両アプローチの長所を橋渡しする小説として,textbfColumn-Normalized Adam (Conda) を提案する。 LLaMAとGPT-2シリーズの実験では、コンダはトレーニング前のAdamW、Muon、その他のベースラインを一貫して上回っている。
参考スコア（独自算出の注目度）: 70.66067959375748
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models (LLMs) have demonstrated impressive generalization and emergent capabilities, yet their pre-training remains computationally expensive and sensitive to optimization dynamics. While Adam-based optimizers offer fast convergence by adapting learning rates coordinate-wise, recent studies reveal that their updates often suffer from poor spectral conditioning and low-rank structures, hindering efficiency. Muon addresses this issue via global spectral normalization but lacks the per-coordinate adaptivity of Adam. In this work, we propose \textbf{Column-Normalized Adam (Conda)}, a novel optimizer that bridges the strengths of both approaches. Conda projects updates into an orthogonal subspace and applies column-wise second moment normalization based on the projected gradients, thereby achieving both improved spectral conditioning and maintaining coordinate-wise adaptivity. This design alleviates the spectral pathologies of Adam while preserving its fast convergence behavior. Extensive experiments on the LLaMA and GPT-2 series show that Conda consistently outperforms AdamW, Muon, and other baselines in pre-training. Remarkably, on the LLaMA series, \textbf{Conda achieves $2{\sim}2.5\times$ the convergence speed of AdamW, measured in both training steps and training time.} Further ablations demonstrate its robustness under diverse training setups. These results collectively highlight Conda as an effective and broadly applicable optimizer for large-scale LLM training. The code is released on https://github.com/jie040109/Conda
Abstract（参考訳）: 大規模言語モデル(LLM)は、目覚ましい一般化と創発的能力を示しているが、事前学習は計算コストが高く、最適化のダイナミクスに敏感である。アダムをベースとしたオプティマイザは、学習率を座標的に調整することで、高速な収束を提供するが、最近の研究では、その更新がスペクトル条件の低下と低ランク構造に悩まされ、効率を損なうことがしばしば示されている。ムーンはこの問題を大域的なスペクトル正規化によって解決するが、アダムの座標ごとの適応性に欠ける。本研究では,両アプローチの強みを橋渡しする新しい最適化器である「textbf{Column-Normalized Adam (Conda)} を提案する。コンダは直交部分空間に更新を投影し、投影された勾配に基づいてカラムワイズ第2モーメント正規化を適用し、改良されたスペクトル条件付けと座標ワイズ適応性の両方を達成する。この設計はアダムのスペクトル病理を緩和し、その高速収束挙動を保っている。 LLaMA と GPT-2 シリーズの広範な実験により、コンダは前訓練においてAdamW、Muon、その他のベースラインを一貫して上回っていることが示されている。注目すべきなのは、LLaMAシリーズでは、トレーニングステップとトレーニング時間の両方で測定されたAdamWの収束速度に対して、 \textbf{Condaが2{\sim}2.5\times$2{\sim}2.5\timesを達成することだ。さらに、多様なトレーニング設定の下で、その堅牢性を示すものもあります。これらの結果から,Conda は大規模 LLM トレーニングにおいて,効果的かつ広く適用可能な最適化手法として注目されている。コードはhttps://github.com/jie040109/Condaでリリースされる

論文の概要: Conda: Column-Normalized Adam for Training Large Language Models Faster

関連論文リスト