Fugu-MT 論文翻訳(概要): AdamP: Slowing Down the Slowdown for Momentum Optimizers on Scale-invariant Weights

論文の概要: AdamP: Slowing Down the Slowdown for Momentum Optimizers on Scale-invariant Weights

arxiv url: http://arxiv.org/abs/2006.08217v3
Date: Mon, 18 Jan 2021 14:36:15 GMT
ステータス: 翻訳完了
システム内更新日: 2022-11-21 02:20:22.702448
Title: AdamP: Slowing Down the Slowdown for Momentum Optimizers on Scale-invariant Weights
Title（参考訳）: AdamP: スケール不変ウェイトにおけるモーメント最適化のスローダウン
Authors: Byeongho Heo, Sanghyuk Chun, Seong Joon Oh, Dongyoon Han, Sangdoo Yun, Gyuwan Kim, Youngjung Uh, Jung-Woo Ha
Abstract要約: 正規化技術は現代の深層学習の恩恵である。しかし、運動量を導入することで、スケール不変の重みに対する効果的なステップサイズが急速に小さくなることがしばしば見過ごされる。本稿では,この2つの材料の組み合わせが,有効ステップサイズと準最適モデル性能の早期劣化につながることを検証した。
参考スコア（独自算出の注目度）: 53.8489656709356
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Normalization techniques are a boon for modern deep learning. They let weights converge more quickly with often better generalization performances. It has been argued that the normalization-induced scale invariance among the weights provides an advantageous ground for gradient descent (GD) optimizers: the effective step sizes are automatically reduced over time, stabilizing the overall training procedure. It is often overlooked, however, that the additional introduction of momentum in GD optimizers results in a far more rapid reduction in effective step sizes for scale-invariant weights, a phenomenon that has not yet been studied and may have caused unwanted side effects in the current practice. This is a crucial issue because arguably the vast majority of modern deep neural networks consist of (1) momentum-based GD (e.g. SGD or Adam) and (2) scale-invariant parameters. In this paper, we verify that the widely-adopted combination of the two ingredients lead to the premature decay of effective step sizes and sub-optimal model performances. We propose a simple and effective remedy, SGDP and AdamP: get rid of the radial component, or the norm-increasing direction, at each optimizer step. Because of the scale invariance, this modification only alters the effective step sizes without changing the effective update directions, thus enjoying the original convergence properties of GD optimizers. Given the ubiquity of momentum GD and scale invariance in machine learning, we have evaluated our methods against the baselines on 13 benchmarks. They range from vision tasks like classification (e.g. ImageNet), retrieval (e.g. CUB and SOP), and detection (e.g. COCO) to language modelling (e.g. WikiText) and audio classification (e.g. DCASE) tasks. We verify that our solution brings about uniform gains in those benchmarks. Source code is available at https://github.com/clovaai/AdamP.
Abstract（参考訳）: 正規化技術は現代の深層学習の恩恵である。彼らはしばしばより良い一般化性能で重みをより早く収束させる。重み間の正規化誘起スケール不変性は、勾配降下(GD)最適化器に有利な土台を与えると論じられ、実効的なステップサイズは時間とともに自動的に減少し、全体的な訓練手順を安定化させる。しかし、GDオプティマイザに運動量を導入することで、スケール不変量に対する効果的なステップサイズが大幅に減少し、これはまだ研究されていない現象であり、現在の実践において望ましくない副作用を引き起こした可能性がある。現代のディープニューラルネットワークの大多数は(1)運動量に基づくgd(sgdやadamなど)と(2)スケール不変パラメータで構成されているため、これは重要な問題である。本稿では,これら2成分の多種多様な組み合わせが,有効なステップサイズとサブ最適モデル性能の早期崩壊につながることを検証した。本稿では,SGDPとAdamPによる簡易かつ効果的な対策として,各最適化ステップにおいて放射状成分(標準増加方向)を除去する手法を提案する。スケールの不変性のため、この修正は有効な更新方向を変更することなく有効なステップサイズだけを変更し、GDオプティマイザの本来の収束特性を享受する。機械学習における運動量GDの多様さとスケール不変性を考慮して,13ベンチマークの基準値に対して評価を行った。それらは、分類(例:イメージネット)、検索(例:cubとsop)、検出(例:coco)、言語モデリング(例:wikitext)、音声分類(例:dcase)といったビジョンタスクから成り立っている。当社のソリューションがベンチマークで均一に向上していることを確認します。ソースコードはhttps://github.com/clovaai/adampで入手できる。

論文の概要: AdamP: Slowing Down the Slowdown for Momentum Optimizers on Scale-invariant Weights

関連論文リスト