Fugu-MT 論文翻訳(概要): Feature Learning in Linear-Width Two-Layer Networks: Two vs. One Step of Gradient Descent

論文の概要: Feature Learning in Linear-Width Two-Layer Networks: Two vs. One Step of Gradient Descent

arxiv url: http://arxiv.org/abs/2605.17767v2
Date: Thu, 21 May 2026 20:45:44 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-25 14:44:53.679172
Title: Feature Learning in Linear-Width Two-Layer Networks: Two vs. One Step of Gradient Descent
Title（参考訳）: 線形幅2層ネットワークにおける特徴学習--2段階と1段階-
Authors: Behrad Moniri, Hamed Hassani,
Abstract要約: 線形幅構造内の2層ニューラルネットワークにおける特徴学習について検討する。初期段階の進化を特徴付けることにより,最適化と特徴学習現象学を研究するための抽出可能な枠組みを提案する。
参考スコア（独自算出の注目度）: 32.9638210129515
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We study feature learning in two-layer neural networks within the linear-width regime, where the number of hidden neurons, sample size, and input dimension scale proportionally. While recent work has analyzed feature learning via a single step of gradient descent on the first layer weights in this regime, such one-step update schemes are fundamentally limited: the update to the weights is approximately rank-one, captures only a single direction, and requires the target function to have an information exponent of one. In this paper, we go beyond one-step updates to provide a full characterization of the features learned during the \textit{second step} of gradient descent with step-sizes $η_1\asymp N^{α_1}$ and $η_2 \asymp N^{α_2}$ for $α_1, α_2 \in [0,0.5)$, where $N$ is the number of hidden neurons. We derive a spectral characterization of the updated weights, demonstrating they behave as a spiked random matrix with multiple outliers, each corresponding to a learned direction. We show that the number of the outliers is determined by the parameters $α_1, α_2$ through $\lfloor \frac{α_2}{1/2 - α_1} \rfloor$. Furthermore, by analyzing the alignment between the learned directions and the target function, we identify a gap between training with independent versus reused batches. While independent batches restrict learning to directions with an information exponent of one, batch reuse enables the second update to capture directions even when the information exponent exceeds one, provided that $α_1, α_2$ are chosen properly. This shows that the benefits of batch reuse, previously observed in narrow-width regimes, persist in the linear-width limit as well. By characterizing these early-phase evolutions, our work proposes a tractable framework for studying optimization and feature learning phenomenology in modern overparameterized networks.
Abstract（参考訳）: 線形幅構造内の2層ニューラルネットワークにおいて,隠れたニューロンの数,サンプルサイズ,入力次元が比例的にスケールする特徴学習について検討した。最近の研究は、この体制における第1層の重みに対する勾配勾配の1ステップによる特徴学習の分析を行っているが、そのような一段階の更新スキームは基本的に制限されている。本稿では, 勾配勾配下降のtextit{second step} で得られた特徴を, ステップサイズ$η_1\asymp N^{α_1}$, $η_2 \asymp N^{α_2}$ for $α_1, α_2 \in [0,0.5)$でフルに評価する。更新された重みのスペクトル的特徴を導出し、複数の外れ値を持つスパイクされたランダム行列として振る舞い、それぞれが学習方向に対応することを実証する。パラメータ $α_1, α_2$ から $\lfloor \frac{α_2}{1/2 - α_1} \rfloor$ に決定されることを示す。さらに、学習方向と対象関数のアライメントを分析することにより、独立したバッチと再利用バッチとのトレーニングのギャップを識別する。独立バッチは1つの情報指数で学習を方向に制限するが、バッチ再利用により、情報指数が1を超える場合でも第2の更新を捕捉でき、$α_1,α_2$が適切に選択される。このことは、以前狭い幅のレシエーションで見られたバッチ再利用の利点が、線形幅の制限でも持続していることを示している。これらの初期段階の進化を特徴付けることによって、現代の過パラメータネットワークにおける最適化と特徴学習現象論を研究するための、抽出可能なフレームワークを提案する。

論文の概要: Feature Learning in Linear-Width Two-Layer Networks: Two vs. One Step of Gradient Descent

関連論文リスト