Fugu-MT 論文翻訳(概要): Extending AdamW by Leveraging Its Second Moment and Magnitude

論文の概要: Extending AdamW by Leveraging Its Second Moment and Magnitude

arxiv url: http://arxiv.org/abs/2112.06125v1
Date: Thu, 9 Dec 2021 12:20:07 GMT
ステータス: 翻訳完了
システム内更新日: 2021-12-14 15:48:29.819723
Title: Extending AdamW by Leveraging Its Second Moment and Magnitude
Title（参考訳）: 第二モーメントとマグニチュードの活用によるAdamWの拡張
Authors: Guoqiang Zhang and Niwa Kenta and W. Bastiaan Kleijn
Abstract要約: 本稿では,AdamWを2つの側面に拡張し,局所的な安定性の学習速度を緩やかにすることを目的とした適応最適化手法を提案する。 Ada は |m_t+1|q/(r_t+1+epsilon)(q/p) の形で等級の q 乗を計算するように設計されている。 10個の玩具問題を解き,2つのディープラーニング(DL)タスクに対してTransformerとSwin-Transformerを訓練する実験を行った。
参考スコア（独自算出の注目度）: 33.26668885327036
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Recent work [4] analyses the local convergence of Adam in a neighbourhood of an optimal solution for a twice-differentiable function. It is found that the learning rate has to be sufficiently small to ensure local stability of the optimal solution. The above convergence results also hold for AdamW. In this work, we propose a new adaptive optimisation method by extending AdamW in two aspects with the purpose to relax the requirement on small learning rate for local stability, which we refer to as Aida. Firstly, we consider tracking the 2nd moment r_t of the pth power of the gradient-magnitudes. r_t reduces to v_t of AdamW when p=2. Suppose {m_t} is the first moment of AdamW. It is known that the update direction m_{t+1}/(v_{t+1}+epsilon)^0.5 (or m_{t+1}/(v_{t+1}^0.5+epsilon) of AdamW (or Adam) can be decomposed as the sign vector sign(m_{t+1}) multiplied elementwise by a vector of magnitudes |m_{t+1}|/(v_{t+1}+epsilon)^0.5 (or |m_{t+1}|/(v_{t+1}^0.5+epsilon)). Aida is designed to compute the qth power of the magnitude in the form of |m_{t+1}|^q/(r_{t+1}+epsilon)^(q/p) (or |m_{t+1}|^q/((r_{t+1})^(q/p)+epsilon)), which reduces to that of AdamW when (p,q)=(2,1). Suppose the origin 0 is a local optimal solution of a twice-differentiable function. It is found theoretically that when q>1 and p>1 in Aida, the origin 0 is locally stable only when the weight-decay is non-zero. Experiments are conducted for solving ten toy optimisation problems and training Transformer and Swin-Transformer for two deep learning (DL) tasks. The empirical study demonstrates that in a number of scenarios (including the two DL tasks), Aida with particular setups of (p,q) not equal to (2,1) outperforms the setup (p,q)=(2,1) of AdamW.
Abstract（参考訳）: 最近の研究[4]は、2次微分可能関数に対する最適解の近傍におけるアダムの局所収束を分析する。最適解の局所安定性を確保するためには,学習速度を十分に小さくする必要がある。上記の収束結果はAdamWにも当てはまる。本研究では,aidaと呼ばれる局所安定のための小さな学習率の要求を緩和することを目的として,adamwを2つの側面に拡張した適応最適化手法を提案する。まず,勾配マグニチュードのp次パワーの第2モーメントr_tの追跡について検討する。 r_t は p=2 のとき adamw の v_t に減少する。 m_t をAdamW の最初の瞬間とする。アダムW(またはアダム)の更新方向 m_{t+1}/(v_{t+1}+epsilon)^0.5 (またはm_{t+1}/(v_{t+1}^0.5+epsilon) は、大きさのベクトル |m_{t+1}|/(v_{t+1}+epsilon)^0.5 (または |m_{t+1}|/(v_{t+1}^0.5+epsilon) によって符号ベクトル記号(m_{t+1})乗算元として分解できることが知られている。 aida は |m_{t+1}|^q/(r_{t+1}+epsilon)^(q/p) (または |m_{t+1}|^q/((r_{t+1})^(q/p)+epsilon) という形でマグニチュード q のパワーを計算するように設計されている。原点 0 を 2 つの微分可能な函数の局所最適解とする。理論的には、aida における q>1 と p>1 のとき、原点 0 は重みが 0 でないときのみ局所安定である。 10個の玩具最適化問題の解決と2つのディープラーニング(DL)タスクのためのTransformerとSwin-Transformerのトレーニング実験を行った。 2つのDLタスクを含む)いくつかのシナリオにおいて、(p,q) の特定のセットアップが (2,1) に等しくないことが、AdamW のセットアップ (p,q)=(2,1) より優れていることを示す実証的研究である。

論文の概要: Extending AdamW by Leveraging Its Second Moment and Magnitude

関連論文リスト