Fugu-MT 論文翻訳(概要): Non-convergence of Adam and other adaptive stochastic gradient descent optimization methods for non-vanishing learning rates

論文の概要: Non-convergence of Adam and other adaptive stochastic gradient descent optimization methods for non-vanishing learning rates

arxiv url: http://arxiv.org/abs/2407.08100v1
Date: Thu, 11 Jul 2024 00:10:35 GMT
ステータス: 翻訳完了
システム内更新日: 2024-07-12 19:18:18.626703
Title: Non-convergence of Adam and other adaptive stochastic gradient descent optimization methods for non-vanishing learning rates
Title（参考訳）: 非収束学習率に対するAdamおよび他の適応確率勾配勾配最適化手法の非収束性
Authors: Steffen Dereich, Robin Graeber, Arnulf Jentzen,
Abstract要約: ディープラーニングアルゴリズムは多くの人工知能(AI)システムにおいて重要な要素である。ディープラーニングアルゴリズムは通常、勾配降下(SGD)最適化法によって訓練されたディープニューラルネットワークのクラスで構成されている。
参考スコア（独自算出の注目度）: 3.6185342807265415
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Deep learning algorithms - typically consisting of a class of deep neural networks trained by a stochastic gradient descent (SGD) optimization method - are nowadays the key ingredients in many artificial intelligence (AI) systems and have revolutionized our ways of working and living in modern societies. For example, SGD methods are used to train powerful large language models (LLMs) such as versions of ChatGPT and Gemini, SGD methods are employed to create successful generative AI based text-to-image creation models such as Midjourney, DALL-E, and Stable Diffusion, but SGD methods are also used to train DNNs to approximately solve scientific models such as partial differential equation (PDE) models from physics and biology and optimal control and stopping problems from engineering. It is known that the plain vanilla standard SGD method fails to converge even in the situation of several convex optimization problems if the learning rates are bounded away from zero. However, in many practical relevant training scenarios, often not the plain vanilla standard SGD method but instead adaptive SGD methods such as the RMSprop and the Adam optimizers, in which the learning rates are modified adaptively during the training process, are employed. This naturally rises the question whether such adaptive optimizers, in which the learning rates are modified adaptively during the training process, do converge in the situation of non-vanishing learning rates. In this work we answer this question negatively by proving that adaptive SGD methods such as the popular Adam optimizer fail to converge to any possible random limit point if the learning rates are asymptotically bounded away from zero. In our proof of this non-convergence result we establish suitable pathwise a priori bounds for a class of accelerated and adaptive SGD methods, which are also of independent interest.
Abstract（参考訳）: ディープラーニングアルゴリズム - 確率勾配降下法(SGD)最適化法によって訓練されたディープニューラルネットワークのクラス - は、今日では多くの人工知能(AI)システムにおいて重要な要素であり、現代の社会における私たちの働き方や生活様式に革命をもたらした。例えば、SGD法はChatGPTやGeminiなどの強力な大規模言語モデル(LLM)のトレーニングに使用されるが、SGD法はMidjourney、DALL-E、Stable DiffusionといったAIベースのテキスト・ツー・イメージ生成モデルの成功に使用される。通常のバニラ標準SGD法は、学習率がゼロから外れている場合、複数の凸最適化問題の状況でも収束しないことが知られている。しかし、多くの実践的な訓練シナリオでは、通常のバニラ標準SGD法ではなく、RMSpropやAdamOptimatorなどの適応SGD法が採用されている。このような適応型オプティマイザは、トレーニングプロセス中に学習率が適応的に修正されるかどうかという疑問が自然に浮き彫りになる。本研究では、学習率が0から漸近的に有界である場合、人気のあるアダムオプティマイザのような適応的なSGD手法が任意のランダムな極限点に収束しないことを証明して、この疑問に否定的に答える。この非収束結果の証明では、独立な関心を持つ加速および適応的なSGD手法のクラスに対して、適切なパスワイズ境界を定めている。

関連論文リスト

PADAM: Parallel averaged Adam reduces the error for stochastic optimization in scientific machine learning [5.052293146674794]
Ruppert-Polyak平均化や指数移動平均化(EMA)といった平均化技術は、一般的なADAMのような勾配降下(SGD)最適化手法の最適化を高速化するための強力なアプローチである。本研究では,並列平均化ADAM(PADAM)と呼ばれる並列平均化手法を提案する。この手法では,ADAMの並列平均化変動を計算し,トレーニングプロセス中に最小の最適化誤差で勾配を動的に選択する。
論文参考訳（メタデータ） (2025-05-28T08:07:34Z)
AutoSGD: Automatic Learning Rate Selection for Stochastic Gradient Descent [58.05410015124021]
本稿では,SGD法であるAutoSGDを紹介する。実験結果から,従来の最適化問題や機械学習タスクにおいて,この手法の強い性能が示唆された。
論文参考訳（メタデータ） (2025-05-27T18:25:21Z)
Training Deep Learning Models with Norm-Constrained LMOs [56.00317694850397]
正規球上の線形最小化オラクル(LMO)を利用する最適化手法について検討する。この問題の幾何学に適応するためにLMOを用いた新しいアルゴリズム群を提案し, 意外なことに, 制約のない問題に適用可能であることを示す。
論文参考訳（メタデータ） (2025-02-11T13:10:34Z)
Averaged Adam accelerates stochastic optimization in the training of deep neural network approximations for partial differential equation and optimal control problems [5.052293146674794]
この研究は古典的なPolyak-Ruppert平均化アプローチにインスパイアされている。本研究では,Adam法の平均変種をディープラーニングネットワーク(DNN)の学習に適用する。それぞれの数値例では、採用される平均変種Adamは標準Adamと標準SGDよりも優れている。
論文参考訳（メタデータ） (2025-01-10T16:15:25Z)
Non-convergence to global minimizers in data driven supervised deep learning: Adam and stochastic gradient descent optimization provably fail to converge to global minimizers in the training of deep neural networks with ReLU activation [3.6185342807265415]
厳密な理論用語でSGD法の成功と限界を説明することは、研究のオープンな問題である。本研究では,最適化問題の大域的最小化に収束しない確率の高いSGD手法の大規模なクラスについて検証する。この研究の一般的な非収束結果は、通常のバニラ標準SGD法だけでなく、多くの加速および適応SGD法にも適用される。
論文参考訳（メタデータ） (2024-10-14T14:11:37Z)
Learning rate adaptive stochastic gradient descent optimization methods: numerical simulations for deep learning methods for partial differential equations and convergence analyses [5.052293146674794]
標準降下(SGD)最適化法は、学習率が0に収束しない場合、アダムのような加速および適応SGD最適化法が収束しないことが知られている。本研究では,経験的推定に基づいて学習率を調整するSGD最適化手法の学習速度適応手法を提案し,検討する。
論文参考訳（メタデータ） (2024-06-20T14:07:39Z)
Implicit Stochastic Gradient Descent for Training Physics-informed Neural Networks [51.92362217307946]
物理インフォームドニューラルネットワーク(PINN)は、前方および逆微分方程式問題の解法として効果的に実証されている。 PINNは、近似すべきターゲット関数が高周波またはマルチスケールの特徴を示す場合、トレーニング障害に閉じ込められる。本稿では,暗黙的勾配降下法(ISGD)を用いてPINNを訓練し,トレーニングプロセスの安定性を向上させることを提案する。
論文参考訳（メタデータ） (2023-03-03T08:17:47Z)
Dissecting adaptive methods in GANs [46.90376306847234]
我々は、適応的手法がGAN(Generative Adversarial Network)の訓練にどう役立つかを検討する。我々は,Adam更新の程度とSGDの正規化方向の更新ルールを考慮し,Adamの適応度がGANトレーニングの鍵であることを実証的に示す。この設定では、nSGDAで訓練されたGANが真の分布のすべてのモードを回復するのに対し、SGDA(および学習率構成)で訓練された同じネットワークはモード崩壊に悩まされていることを証明している。
論文参考訳（メタデータ） (2022-10-09T19:00:07Z)
Convergence proof for stochastic gradient descent in the training of deep neural networks with ReLU activation for constant target functions [1.7149364927872015]
勾配降下(SGD)型最適化法はディープニューラルネットワーク(DNN)の訓練において非常に効果的に機能する本研究では,修正線形単位(ReLU)アクティベーションを備えた完全連結フィードフォワードDNNのトレーニングにおけるSGD型最適化手法について検討する。
論文参考訳（メタデータ） (2021-12-13T11:45:36Z)
Direction Matters: On the Implicit Bias of Stochastic Gradient Descent with Moderate Learning Rate [105.62979485062756]
本稿では,中等度学習におけるSGDの特定の正規化効果を特徴付けることを試みる。 SGDはデータ行列の大きな固有値方向に沿って収束し、GDは小さな固有値方向に沿って収束することを示す。
論文参考訳（メタデータ） (2020-11-04T21:07:52Z)
Adaptive Gradient Method with Resilience and Momentum [120.83046824742455]
レジリエンスとモメンタム(AdaRem)を用いた適応勾配法を提案する。 AdaRemは、過去の1つのパラメータの変化方向が現在の勾配の方向と一致しているかどうかに応じてパラメータワイズ学習率を調整する。本手法は,学習速度とテスト誤差の観点から,従来の適応学習率に基づくアルゴリズムよりも優れていた。
論文参考訳（メタデータ） (2020-10-21T14:49:00Z)
MaxVA: Fast Adaptation of Step Sizes by Maximizing Observed Variance of Gradients [112.00379151834242]
本稿では,Adamにおける2乗勾配のランニング平均を重み付き平均に置き換える適応学習率の原理を提案する。これにより、より高速な適応が可能となり、より望ましい経験的収束挙動がもたらされる。
論文参考訳（メタデータ） (2020-06-21T21:47:43Z)
AdaS: Adaptive Scheduling of Stochastic Gradients [50.80697760166045]
我々は、textit "knowledge gain" と textit "mapping condition" の概念を導入し、Adaptive Scheduling (AdaS) と呼ばれる新しいアルゴリズムを提案する。実験によると、AdaSは派生した指標を用いて、既存の適応学習手法よりも高速な収束と優れた一般化、そして(b)いつトレーニングを中止するかを決定するための検証セットへの依存の欠如を示す。
論文参考訳（メタデータ） (2020-06-11T16:36:31Z)
A Dynamic Sampling Adaptive-SGD Method for Machine Learning [8.173034693197351]
本稿では,勾配近似の計算に使用されるバッチサイズと,その方向に移動するステップサイズを適応的に制御する手法を提案する。提案手法は局所曲率情報を利用して探索方向を高い確率で降下方向とする。数値実験により、この手法は最適な学習率を選択することができ、ロジスティック回帰とDNNを訓練するための微調整されたSGDと好適に比較できることが示された。
論文参考訳（メタデータ） (2019-12-31T15:36:44Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。