Fugu-MT 論文翻訳(概要): SPAM: Spike-Aware Adam with Momentum Reset for Stable LLM Training

論文の概要: SPAM: Spike-Aware Adam with Momentum Reset for Stable LLM Training

arxiv url: http://arxiv.org/abs/2501.06842v1
Date: Sun, 12 Jan 2025 15:21:22 GMT
ステータス: 翻訳完了
システム内更新日: 2025-01-14 17:20:21.28056
Title: SPAM: Spike-Aware Adam with Momentum Reset for Stable LLM Training
Title（参考訳）: SPAM: 安定LLMトレーニングのためのモメンタムリセット付きスパイク対応アダム
Authors: Tianjin Huang, Ziquan Zhu, Gaojie Jin, Lu Liu, Zhangyang Wang, Shiwei Liu,
Abstract要約: 大規模言語モデル(LLM)は、様々なタスクにまたがる例外的なパフォーマンスを示しているが、そのトレーニングは、非常にリソース集約的で、トレーニングの不安定性に影響を受けやすいままである。本稿では,LLMトレーニング中に観測された勾配スパイクを包括的に調査し,複数のアーキテクチャやデータセットにまたがる傾向を明らかにする。本稿では,モーメントムリセットを用いたスパイク・アウェア・アダムを提案し,モーメントムリセットとスパイク・アウェア・クリッピングによる勾配スパイク対策について述べる。
参考スコア（独自算出の注目度）: 60.9776082805359
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Language Models (LLMs) have demonstrated exceptional performance across diverse tasks, yet their training remains highly resource-intensive and susceptible to critical challenges such as training instability. A predominant source of this instability stems from gradient and loss spikes, which disrupt the learning process, often leading to costly interventions like checkpoint recovery and experiment restarts, further amplifying inefficiencies. This paper presents a comprehensive investigation into gradient spikes observed during LLM training, revealing their prevalence across multiple architectures and datasets. Our analysis shows that these spikes can be up to $1000\times$ larger than typical gradients, substantially deteriorating model performance. To address this issue, we propose Spike-Aware Adam with Momentum Reset SPAM, a novel optimizer designed to counteract gradient spikes through momentum reset and spike-aware gradient clipping. Extensive experiments, including both pre-training and fine-tuning, demonstrate that SPAM consistently surpasses Adam and its variants across various tasks, including (1) LLM pre-training from 60M to 1B, (2) 4-bit LLM pre-training,(3) reinforcement learning, and (4) Time Series Forecasting. Additionally, SPAM facilitates memory-efficient training by enabling sparse momentum, where only a subset of momentum terms are maintained and updated. When operating under memory constraints, SPAM outperforms state-of-the-art memory-efficient optimizers such as GaLore and Adam-Mini. Our work underscores the importance of mitigating gradient spikes in LLM training and introduces an effective optimization strategy that enhances both training stability and resource efficiency at scale. Code is available at https://github.com/TianjinYellow/SPAM-Optimizer.git
Abstract（参考訳）: 大規模言語モデル(LLM)は、様々なタスクにまたがる例外的なパフォーマンスを示しているが、そのトレーニングは、非常にリソース集約的で、トレーニング不安定のような重要な課題の影響を受けやすいままである。この不安定性の主な原因は、学習プロセスを阻害する勾配と損失のスパイクであり、多くの場合、チェックポイントの回復や実験の再起動のようなコストのかかる介入につながり、さらに非効率を増幅する。本稿では,LLMトレーニング中に観測された勾配スパイクを包括的に調査し,複数のアーキテクチャやデータセットにまたがる傾向を明らかにする。我々の分析によると、これらのスパイクは典型的な勾配よりも最大1000\times$で、モデル性能を著しく劣化させる可能性がある。この問題を解決するために,運動量リセットとスパイク対応勾配クリッピングによる勾配スパイク対策を目的とした新しい最適化器であるMomentum Reset SPAMを用いたSpike-Aware Adamを提案する。予備訓練と微調整の両方を含む広範囲な実験により,SPAMは,(1)LLM前訓練を60Mから1Bに,(2)4ビットLLM前訓練を,(3)強化学習を,(4)時系列予測を,Adamとその変種を一貫して上回っていることが示された。さらにSPAMは、モーメント項のサブセットのみが維持され更新されるスパースモーメントを有効にすることにより、メモリ効率のトレーニングを促進する。メモリ制約下での操作では、SPAMはGaLoreやAdam-Miniのような最先端のメモリ効率の最適化よりも優れている。本研究は,LLMトレーニングにおける勾配スパイク低減の重要性を浮き彫りにして,訓練安定性と大規模資源効率を両立させる効果的な最適化戦略を提案する。コードはhttps://github.com/TianjinYellow/SPAM-Optimizer.gitで入手できる。

論文の概要: SPAM: Spike-Aware Adam with Momentum Reset for Stable LLM Training

関連論文リスト