Fugu-MT 論文翻訳(概要): Robust Training of Neural Networks using Scale Invariant Architectures

論文の概要: Robust Training of Neural Networks using Scale Invariant Architectures

arxiv url: http://arxiv.org/abs/2202.00980v1
Date: Wed, 2 Feb 2022 11:58:56 GMT
ステータス: 翻訳完了
システム内更新日: 2022-02-03 13:47:06.732767
Title: Robust Training of Neural Networks using Scale Invariant Architectures
Title（参考訳）: スケール不変アーキテクチャを用いたニューラルネットワークのロバストトレーニング
Authors: Zhiyuan Li, Srinadh Bhojanapalli, Manzil Zaheer, Sashank J. Reddi, Sanjiv Kumar
Abstract要約: SGDとは対照的に、Adamのような適応勾配法は、現代のディープネットワークの堅牢なトレーニングを可能にする。この一般的なアプローチは、パラメータと損失の再スケーリングに頑健であることを示す。我々は、単にバニラSGDで訓練された場合、Adamのような適応的な手法で訓練されたBERTに匹敵する性能を達成する、SIBERTと呼ばれるスケール不変バージョンのBERTを設計する。
参考スコア（独自算出の注目度）: 70.67803417918854
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In contrast to SGD, adaptive gradient methods like Adam allow robust training of modern deep networks, especially large language models. However, the use of adaptivity not only comes at the cost of extra memory but also raises the fundamental question: can non-adaptive methods like SGD enjoy similar benefits? In this paper, we provide an affirmative answer to this question by proposing to achieve both robust and memory-efficient training via the following general recipe: (1) modify the architecture and make it scale invariant, i.e. the scale of parameter doesn't affect the output of the network, (2) train with SGD and weight decay, and optionally (3) clip the global gradient norm proportional to weight norm multiplied by $\sqrt{\tfrac{2\lambda}{\eta}}$, where $\eta$ is learning rate and $\lambda$ is weight decay. We show that this general approach is robust to rescaling of parameter and loss by proving that its convergence only depends logarithmically on the scale of initialization and loss, whereas the standard SGD might not even converge for many initializations. Following our recipe, we design a scale invariant version of BERT, called SIBERT, which when trained simply by vanilla SGD achieves performance comparable to BERT trained by adaptive methods like Adam on downstream tasks.
Abstract（参考訳）: SGDとは対照的に、Adamのような適応勾配法は現代のディープネットワーク、特に大きな言語モデルの堅牢なトレーニングを可能にする。しかし、適応性の使用は、余分なメモリのコストだけでなく、根本的な疑問も生じている:SGDのような非適応的な手法は、同様の利点を享受できるだろうか? In this paper, we provide an affirmative answer to this question by proposing to achieve both robust and memory-efficient training via the following general recipe: (1) modify the architecture and make it scale invariant, i.e. the scale of parameter doesn't affect the output of the network, (2) train with SGD and weight decay, and optionally (3) clip the global gradient norm proportional to weight norm multiplied by $\sqrt{\tfrac{2\lambda}{\eta}}$, where $\eta$ is learning rate and $\lambda$ is weight decay. この一般的なアプローチは、初期化と損失のスケールにおいて、その収束が対数的にのみ依存することを証明することによって、パラメータと損失の再スケーリングに頑健である。提案手法に従うと, SIBERT と呼ばれる BERT のスケール不変バージョンを設計し, 単にバニラSGD で訓練すれば, 下流タスクにおけるAdam などの適応手法で訓練された BERT に匹敵する性能を実現する。

関連論文リスト

No More Adam: Learning Rate Scaling at Initialization is All You Need [13.892699813809857]
SGD-SaIは運動量による勾配降下(SGDM)の簡易かつ効果的な増強である適応的な2階運動量に頼ることなく学習率を調整することで、SGD-SaIはトレーニングの不均衡を第1段階から防ぐことができる。その単純さと効率にもかかわらず、SGD-SaIは様々なトランスフォーマーベースのタスクのトレーニングにおいて、AdamWと一貫して一致し、より優れています。
論文参考訳（メタデータ） (2024-12-16T13:41:37Z)
PACE: Marrying generalization in PArameter-efficient fine-tuning with Consistency rEgularization [35.922096876707975]
PACE は PArameter- efficient fine-tuning with Consistency rEgularization の一般化である。拡張一般化のための勾配を暗黙的に正規化するが、知識を保持するために微調整されたモデルや事前訓練されたモデルも暗黙的に整列する。また、テキスト分類(GLUE)や数学的推論においてLoRAを改善している。
論文参考訳（メタデータ） (2024-09-25T17:56:00Z)
Winner-Take-All Column Row Sampling for Memory Efficient Adaptation of Language Model [89.8764435351222]
分散を低減した行列生成のために, WTA-CRS と呼ばれる新しい非バイアス推定系を提案する。我々の研究は、チューニング変換器の文脈において、提案した推定器が既存のものよりも低い分散を示すという理論的および実験的証拠を提供する。
論文参考訳（メタデータ） (2023-05-24T15:52:08Z)
Dissecting adaptive methods in GANs [46.90376306847234]
我々は、適応的手法がGAN(Generative Adversarial Network)の訓練にどう役立つかを検討する。我々は,Adam更新の程度とSGDの正規化方向の更新ルールを考慮し,Adamの適応度がGANトレーニングの鍵であることを実証的に示す。この設定では、nSGDAで訓練されたGANが真の分布のすべてのモードを回復するのに対し、SGDA(および学習率構成)で訓練された同じネットワークはモード崩壊に悩まされていることを証明している。
論文参考訳（メタデータ） (2022-10-09T19:00:07Z)
Biologically Plausible Training Mechanisms for Self-Supervised Learning in Deep Networks [14.685237010856953]
我々は,深層ネットワークにおける自己教師付き学習(SSL)のための生物学的に妥当なトレーニング機構を開発する。バックパゲーションの2つの選択肢のうちの1つを用いて学習を行うことができることを示す。
論文参考訳（メタデータ） (2021-09-30T12:56:57Z)
GradInit: Learning to Initialize Neural Networks for Stable and Efficient Training [59.160154997555956]
ニューラルネットワークを初期化するための自動化およびアーキテクチャ手法であるgradinitを提案する。各ネットワーク層の分散は、SGDまたはAdamの単一ステップが最小の損失値をもたらすように調整される。また、学習率のウォームアップを伴わずに、オリジナルのPost-LN Transformerを機械翻訳用にトレーニングすることもできる。
論文参考訳（メタデータ） (2021-02-16T11:45:35Z)
Attentional-Biased Stochastic Gradient Descent [74.49926199036481]
深層学習におけるデータ不均衡やラベルノイズ問題に対処するための証明可能な手法(ABSGD)を提案する。本手法は運動量SGDの簡易な修正であり,各試料に個別の重み付けを行う。 ABSGDは追加コストなしで他の堅牢な損失と組み合わせられるほど柔軟である。
論文参考訳（メタデータ） (2020-12-13T03:41:52Z)
MaxVA: Fast Adaptation of Step Sizes by Maximizing Observed Variance of Gradients [112.00379151834242]
本稿では,Adamにおける2乗勾配のランニング平均を重み付き平均に置き換える適応学習率の原理を提案する。これにより、より高速な適応が可能となり、より望ましい経験的収束挙動がもたらされる。
論文参考訳（メタデータ） (2020-06-21T21:47:43Z)
AdamP: Slowing Down the Slowdown for Momentum Optimizers on Scale-invariant Weights [53.8489656709356]
正規化技術は現代の深層学習の恩恵である。しかし、運動量を導入することで、スケール不変の重みに対する効果的なステップサイズが急速に小さくなることがしばしば見過ごされる。本稿では,この2つの材料の組み合わせが,有効ステップサイズと準最適モデル性能の早期劣化につながることを検証した。
論文参考訳（メタデータ） (2020-06-15T08:35:15Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。