Fugu-MT 論文翻訳(概要): The Optimization Landscape of SGD Across the Feature Learning Strength

論文の概要: The Optimization Landscape of SGD Across the Feature Learning Strength

arxiv url: http://arxiv.org/abs/2410.04642v2
Date: Tue, 8 Oct 2024 12:28:22 GMT
ステータス: 翻訳完了
システム内更新日: 2024-11-02 02:47:36.387287
Title: The Optimization Landscape of SGD Across the Feature Learning Strength
Title（参考訳）: 特徴学習力を超えたSGDの最適化景観
Authors: Alexander Atanasov, Alexandru Meterez, James B. Simon, Cengiz Pehlevan,
Abstract要約: オンライントレーニング環境で、さまざまなモデルやデータセットに$gamma$をスケーリングする効果について検討する。最適なオンラインパフォーマンスは、しばしば大きな$gamma$で見られます。以上の結果から,大容量ガンマ$限界の解析的研究は,実演モデルにおける表現学習のダイナミクスに関する有用な知見をもたらす可能性が示唆された。
参考スコア（独自算出の注目度）: 102.1353410293931
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We consider neural networks (NNs) where the final layer is down-scaled by a fixed hyperparameter $\gamma$. Recent work has identified $\gamma$ as controlling the strength of feature learning. As $\gamma$ increases, network evolution changes from "lazy" kernel dynamics to "rich" feature-learning dynamics, with a host of associated benefits including improved performance on common tasks. In this work, we conduct a thorough empirical investigation of the effect of scaling $\gamma$ across a variety of models and datasets in the online training setting. We first examine the interaction of $\gamma$ with the learning rate $\eta$, identifying several scaling regimes in the $\gamma$-$\eta$ plane which we explain theoretically using a simple model. We find that the optimal learning rate $\eta^*$ scales non-trivially with $\gamma$. In particular, $\eta^* \propto \gamma^2$ when $\gamma \ll 1$ and $\eta^* \propto \gamma^{2/L}$ when $\gamma \gg 1$ for a feed-forward network of depth $L$. Using this optimal learning rate scaling, we proceed with an empirical study of the under-explored "ultra-rich" $\gamma \gg 1$ regime. We find that networks in this regime display characteristic loss curves, starting with a long plateau followed by a drop-off, sometimes followed by one or more additional staircase steps. We find networks of different large $\gamma$ values optimize along similar trajectories up to a reparameterization of time. We further find that optimal online performance is often found at large $\gamma$ and could be missed if this hyperparameter is not tuned. Our findings indicate that analytical study of the large-$\gamma$ limit may yield useful insights into the dynamics of representation learning in performant models.
Abstract（参考訳）: 我々は、最終層が固定されたハイパーパラメータ$\gamma$でダウンスケールされるニューラルネットワーク(NN)を考える。最近の研究によると、$\gamma$は特徴学習の強みをコントロールしている。 $\gamma$が増加するにつれて、ネットワークの進化は"怠慢"なカーネルダイナミクスから"リッチ"な機能学習ダイナミクスへと変化し、一般的なタスクのパフォーマンスの改善を含む多くのメリットがもたらされる。本研究では,オンライントレーニング環境において,さまざまなモデルやデータセットに対して$\gamma$をスケーリングする効果について,徹底的な実証的研究を行う。最初に、$\gamma$と学習率$\eta$の相互作用を調べ、単純なモデルを用いて理論的に説明できる$\gamma$-$\eta$平面内のいくつかのスケーリング機構を特定する。最適学習率$\eta^*$は$\gamma$と非自明にスケールする。特に、$\eta^* \propto \gamma^2$ if $\gamma \ll 1$ and $\eta^* \propto \gamma^{2/L}$ when $\gamma \gg 1$ for a feed-forward network of depth $L$。この最適学習率のスケーリングを用いて、未探索の「ウルトラリッチ」$\gamma \gg 1$ regimeを実証研究する。この状態のネットワークは、長い台地から始まり、ドロップオフ、時には1つ以上の階段ステップで、特性損失曲線を表示する。異なる大きな$\gamma$値のネットワークは、時間の再パラメータ化まで、同様の軌道に沿って最適化されている。さらに、最適オンラインパフォーマンスは大きな$\gamma$でしばしば見出され、このハイパーパラメータがチューニングされない場合は見逃される可能性がある。以上の結果から, 実演モデルにおける表現学習のダイナミクスに関する有用な知見が得られる可能性が示唆された。

論文の概要: The Optimization Landscape of SGD Across the Feature Learning Strength

関連論文リスト