Fugu-MT 論文翻訳(概要): Adam$^+$: A Stochastic Method with Adaptive Variance Reduction

論文の概要: Adam$^+$: A Stochastic Method with Adaptive Variance Reduction

arxiv url: http://arxiv.org/abs/2011.11985v1
Date: Tue, 24 Nov 2020 09:28:53 GMT
ステータス: 翻訳完了
システム内更新日: 2022-09-21 14:13:22.313802
Title: Adam$^+$: A Stochastic Method with Adaptive Variance Reduction
Title（参考訳）: Adam$^+$: アダプティブ変数還元を用いた確率的手法
Authors: Mingrui Liu, Wei Zhang, Francesco Orabona, Tianbao Yang
Abstract要約: Adamはディープラーニングアプリケーションに広く使われている最適化手法である。我々はAdam$+$(Adam-plusと発音する)という新しい方法を提案する。画像分類,言語モデリング,自動音声認識など,さまざまなディープラーニングタスクに関する実証研究により,Adam$+$がAdamを著しく上回ることを示した。
参考スコア（独自算出の注目度）: 56.051001950733315
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Adam is a widely used stochastic optimization method for deep learning applications. While practitioners prefer Adam because it requires less parameter tuning, its use is problematic from a theoretical point of view since it may not converge. Variants of Adam have been proposed with provable convergence guarantee, but they tend not be competitive with Adam on the practical performance. In this paper, we propose a new method named Adam$^+$ (pronounced as Adam-plus). Adam$^+$ retains some of the key components of Adam but it also has several noticeable differences: (i) it does not maintain the moving average of second moment estimate but instead computes the moving average of first moment estimate at extrapolated points; (ii) its adaptive step size is formed not by dividing the square root of second moment estimate but instead by dividing the root of the norm of first moment estimate. As a result, Adam$^+$ requires few parameter tuning, as Adam, but it enjoys a provable convergence guarantee. Our analysis further shows that Adam$^+$ enjoys adaptive variance reduction, i.e., the variance of the stochastic gradient estimator reduces as the algorithm converges, hence enjoying an adaptive convergence. We also propose a more general variant of Adam$^+$ with different adaptive step sizes and establish their fast convergence rate. Our empirical studies on various deep learning tasks, including image classification, language modeling, and automatic speech recognition, demonstrate that Adam$^+$ significantly outperforms Adam and achieves comparable performance with best-tuned SGD and momentum SGD.
Abstract（参考訳）: Adamはディープラーニングアプリケーションに広く使われている確率最適化手法である。実践者はパラメータチューニングをあまり必要としないためAdamを好むが、その使用は理論的な観点から問題となる。アダムの変種は証明可能な収束保証で提案されているが、実際的な性能ではアダムと競合しない傾向がある。本稿では,Adam$^+$(Adam-plusと発音する)という新しい手法を提案する。 Adam$^+$はAdamのキーコンポーネントのいくつかを保持するが、いくつかの顕著な違いもある。 (i)第2モーメント推定の移動平均を維持しず、その代わりに外挿点における第1モーメント推定の移動平均を計算する。 (ii)その適応ステップサイズは、第2モーメント推定の平方根を割ることではなく、第1モーメント推定のノルムの根を割ることによって形成される。その結果、Adam$^+$はAdamのようにパラメータチューニングをほとんど必要としないが、証明可能な収束を保証する。さらに、Adam$^+$は適応的な分散還元、すなわち確率勾配推定器の分散はアルゴリズムが収束するにつれて減少し、適応収束を享受することを示す。また,適応ステップサイズが異なるadam$^+$のより一般的な変種を提案し,その高速収束速度を確立する。画像分類,言語モデリング,自動音声認識など,さまざまなディープラーニングタスクに関する実証研究により,Adam$^+$がAdamを著しく上回り,最高の学習SGDと運動量SGDで同等の性能を発揮することを示した。

論文の概要: Adam$^+$: A Stochastic Method with Adaptive Variance Reduction

関連論文リスト