Fugu-MT 論文翻訳(概要): StoSignSGD: Unbiased Structural Stochasticity Fixes SignSGD for Training Large Language Models

論文の概要: StoSignSGD: Unbiased Structural Stochasticity Fixes SignSGD for Training Large Language Models

arxiv url: http://arxiv.org/abs/2604.15416v1
Date: Thu, 16 Apr 2026 17:55:36 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-20 22:00:19.598565
Title: StoSignSGD: Unbiased Structural Stochasticity Fixes SignSGD for Training Large Language Models
Title（参考訳）: StoSignSGD: 大規模言語モデルのトレーニングのためのSignSGDの修正
Authors: Dingzhi Yu, Rui Pan, Yuxing Liu, Tong Zhang,
Abstract要約: SignSGDのような符号ベースの最適化アルゴリズムは、分散学習と大規模基盤モデルの訓練において、顕著な性能で大きな注目を集めている。実験的な優位性にもかかわらず、SignSGDは非テキスト・平滑な目的において分岐することが知られている。我々はbfStoSignSGDを提案する。bfStoSignSGDは、不偏更新ステップを維持しつつ、手話演算子に構造性を注入するアルゴリズムである。
参考スコア（独自算出の注目度）: 16.690425653502256
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Sign-based optimization algorithms, such as SignSGD, have garnered significant attention for their remarkable performance in distributed learning and training large foundation models. Despite their empirical superiority, SignSGD is known to diverge on non-smooth objectives, which are ubiquitous in modern machine learning due to ReLUs, max-pools, and mixture-of-experts. To overcome this fundamental limitation, we propose \textbf{StoSignSGD}, an algorithm that injects structural stochasticity into the sign operator while maintaining an unbiased update step. In the regime of (online) convex optimization, our theoretical analysis shows that StoSignSGD rigorously resolves the non-convergence issues of SignSGD, achieving a sharp convergence rate matching the lower bound. For the more challenging non-convex non-smooth optimization, we introduce generalized stationary measures that encompass prior definitions, proving that StoSignSGD improves upon the best-known complexity bounds by dimensional factors. Empirically, StoSignSGD exhibits robust stability and superior efficiency across diverse large language model (LLM) training regimes. Notably, in low-precision FP8 pretraining -- a setting where AdamW fails catastrophically -- StoSignSGD remains highly stable and yields a remarkable 1.44$\times$ to 2.14$\times$ speedup relative to established baselines. Furthermore, when fine-tuning 7B LLMs on mathematical reasoning tasks, StoSignSGD delivers substantial performance gains over both AdamW and SignSGD. Finally, to dissect the mechanisms driving its success, we develop a sign conversion framework capable of transforming any general optimizer into its unbiased, sign-based counterpart. Utilizing this framework, we deconstruct the core components of StoSignSGD and present a comprehensive ablation study to empirically validate our algorithmic design choices.
Abstract（参考訳）: SignSGDのような符号ベースの最適化アルゴリズムは、分散学習と大規模基盤モデルの訓練において、顕著な性能で大きな注目を集めている。実験的な優位性にもかかわらず、SignSGDは、ReLU、max-pools、mix-of-expertsによる現代の機械学習においてユビキタスな非平滑な目的に基づいて分散することが知られている。この基本的な制限を克服するために,不偏更新ステップを維持しながら手話演算子に構造確率を注入するアルゴリズムである \textbf{StoSignSGD} を提案する。オンライン凸最適化では,StoSignSGD が SignSGD の非収束問題を厳密に解決し,下界に適合する鋭い収束率を達成する。より困難な非凸な非滑らかな最適化のために、StoSignSGDが次元因子によって制限される最もよく知られた複雑性を改善することを証明し、事前定義を含む一般化された定常測度を導入する。経験的に、StoSignSGDは様々な大規模言語モデル(LLM)訓練体制において、堅牢な安定性と優れた効率を示す。特に、AdamWが破滅的に失敗する低精度FP8事前訓練では、StoSignSGDは非常に安定であり、確立されたベースラインに対して1.44$\times$から2.14$\times$スピードアップとなる。さらに、数学的な推論タスクに7B LLMを微調整すると、StoSignSGDはAdamWとSignSGDの両方に対して大幅な性能向上をもたらす。最後に、その成功を導くメカニズムを解明するために、任意の一般化最適化器を非バイアスの符号ベースに変換できる符号変換フレームワークを開発する。このフレームワークを利用して、StoSignSGDのコアコンポーネントを分解し、アルゴリズム設計の選択を実証的に検証するための包括的なアブレーション研究を示す。

論文の概要: StoSignSGD: Unbiased Structural Stochasticity Fixes SignSGD for Training Large Language Models

関連論文リスト