Fugu-MT 論文翻訳(概要): EMA-Nesterov: Stabilizing Nesterov's Lookahead for Accelerated Deep Learning Optimization

論文の概要: EMA-Nesterov: Stabilizing Nesterov's Lookahead for Accelerated Deep Learning Optimization

arxiv url: http://arxiv.org/abs/2605.25395v1
Date: Mon, 25 May 2026 03:39:10 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-26 19:50:19.275362
Title: EMA-Nesterov: Stabilizing Nesterov's Lookahead for Accelerated Deep Learning Optimization
Title（参考訳）: EMA-Nesterov: ディープラーニング最適化のためのNesterovのルックアヘッドの安定化
Authors: Chung-Yiu Yau, Dawei Li, Athanasios Glentis, Valentyn Boreiko, Hoi-To Wai, Mingyi Hong,
Abstract要約: 標準ネステロフを指数移動平均(EMA)で置き換える簡単な修正であるEMA-Nesterovを提案する。言語モデル事前学習に関する実証的証拠を提示し、EMA-ネステロフが様々な微調整ベースで広く適用可能であることを検証する。
参考スコア（独自算出の注目度）: 29.89435961169451
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Lookahead-based acceleration methods, such as Nesterov's momentum, are widely used in optimization, but they often become unreliable in deep learning training mainly due to stochastic gradient noise and non-convex loss landscapes. In particular, standard lookahead relies on short-horizon update signals (e.g., differences between consecutive iterates), which are inherently noisy and can lead to unstable extrapolation directions. This work revisits Nesterov's acceleration from a trajectory perspective and argues that effective acceleration in deep learning should harness the low-frequency trends of optimization trajectories rather than extrapolating noisy one-step updates. Leveraging this insight, we propose EMA-Nesterov, a simple modification that replaces the standard Nesterov's lookahead direction with an exponential moving average (EMA) of parameter updates. This yields a stabilized lookahead direction that captures and harnesses the evolving trend of the training trajectory through a low-pass filter, while remaining adaptive to progressive changes via the geometric weighting structure of EMA. We show that EMA-Nesterov retains a theoretical accelerated convergence rate in convex problems that is analogous to Nesterov's accelerated gradient method. Furthermore, we provide empirical evidence on language model pre-training to verify that EMA-Nesterov is broadly applicable across a range of fine-tuned base optimizers, including Adam, SOAP, Muon, as well as complex optimizers that achieve state-of-the-art performance on optimization benchmarks (NanoGPT). Compared to prior lookahead methods, EMA-Nesterov achieves better performance by avoiding the instability of short-horizon lookahead and the non-adaptivity of long-horizon lookahead.
Abstract（参考訳）: ネステロフの運動量のようなルックアヘッドに基づく加速度法は最適化に広く用いられているが、確率的勾配雑音や非凸損失景観による深層学習では信頼性が低いことが多い。特に、標準的なルックアヘッドは短い水平更新信号(例えば、連続するイテレートの違いなど)に依存しており、本質的にノイズがあり不安定な外挿方向につながる可能性がある。この研究は、ネステロフの加速を軌跡の観点から再考し、ディープラーニングにおける効果的な加速は、ノイズの多い一段階の更新を外挿するのではなく、最適化軌道の低周波トレンドを利用するべきであると主張している。この知見を活かしたEMA-Nesterovは,標準のNesterovのルックアヘッド方向をパラメータ更新の指数移動平均(EMA)に置き換えるシンプルな修正である。これは、EMAの幾何重み付け構造による進行変化に適応しながら、ローパスフィルタを通してトレーニング軌道の進化傾向を捉え、活用する安定したルックアヘッド方向を与える。我々は、EMA-ネステロフが、ネステロフの加速勾配法に類似した凸問題において理論的に加速収束速度を維持していることを示す。さらに、EMA-NesterovがAdam、SOAP、Muonなどの細調整されたベースオプティマイザや、最適化ベンチマーク(NanoGPT)で最先端のパフォーマンスを達成する複雑なオプティマイザに広く適用可能であることを検証するために、言語モデルの事前トレーニングに関する実証的な証拠を提供する。従来のルックアヘッド法と比較して、EMA-ネステロフは、短水平ルックアヘッドの不安定性と長水平ルックアヘッドの非適応性を回避し、より良い性能を達成する。

論文の概要: EMA-Nesterov: Stabilizing Nesterov's Lookahead for Accelerated Deep Learning Optimization

関連論文リスト