Fugu-MT 論文翻訳(概要): $\mathbf{Li_2}$: A Framework on Dynamics of Feature Emergence and Delayed Generalization

論文の概要: $\mathbf{Li_2}$: A Framework on Dynamics of Feature Emergence and Delayed Generalization

arxiv url: http://arxiv.org/abs/2509.21519v1
Date: Thu, 25 Sep 2025 20:08:09 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-29 20:57:53.974345
Title: $\mathbf{Li_2}$: A Framework on Dynamics of Feature Emergence and Delayed Generalization
Title（参考訳）: $\mathbf{Li_2}$: 特徴発生と遅延一般化のダイナミクスに関するフレームワーク
Authors: Yuandong Tian,
Abstract要約: 非線形ネットワークにおけるグラッキング現象,すなわち遅延一般化について検討する。 2層非線形ネットワークのグルーキング動作の3つの重要な段階を捉える。我々の研究は、体重減少、学習率、グルーキングにおけるサイズといったハイパーマスが果たす役割に光を当てています。
参考スコア（独自算出の注目度）: 44.614763110719274
License: http://creativecommons.org/licenses/by/4.0/
Abstract: While the phenomenon of grokking, i.e., delayed generalization, has been studied extensively, it remains an open question whether there is a mathematical framework to characterize what kind of features emerge, how and in which conditions it happens from training, for complex structured inputs. We propose a novel framework, named $\mathbf{Li_2}$, that captures three key stages for the grokking behavior of 2-layer nonlinear networks: (I) Lazy learning, (II) independent feature learning and (III) interactive feature learning, characterized by the structure of backpropagated gradient $G_F$ across layers. In (I), $G_F$ is random, and top layer overfits to random hidden representation. In (II), the gradient of each node (column of $G_F$) only depends on its own activation, and thus each hidden node learns their representation independently from $G_F$, which now carries information about target labels, thanks to weight decay. Interestingly, the independent dynamics follows exactly the gradient ascent of an energy function $E$, and its local maxima are precisely the emerging features. We study whether these local-optima induced features are generalizable, their representation power, and how they change on sample size, in group arithmetic tasks. Finally, in (III), we provably show how hidden nodes interact, and how $G_F$ changes to focus on missing features that need to be learned. Our study sheds lights on roles played by key hyperparameters such as weight decay, learning rate and sample sizes in grokking, leads to provable scaling laws of memorization and generalization, and reveals the underlying cause why recent optimizers such as Muon can be effective, from the first principles of gradient dynamics. Our analysis can be extended to multi-layer architectures.
Abstract（参考訳）: グルーキングの現象、すなわち遅延一般化は広く研究されているが、複雑な構造化された入力に対して、どのような特徴が現れるのか、どのように、どのように、どのような条件で起こるのかを特徴づける数学的枠組みが存在するかどうかには疑問が残る。遅延学習, (II) 独立な特徴学習, (III) インタラクティブな特徴学習という2層非線形ネットワークのグルーキング動作の3つの重要な段階を捉える, $\mathbf{Li_2}$ という新しいフレームワークを提案する。 I)では、$G_F$はランダムであり、トップ層はランダムな隠れ表現に過度に適合する。第二に、各ノード($G_F$のカラム)の勾配は、自身のアクティベーションにのみ依存するため、各隠れたノードは、現在ターゲットラベルに関する情報を持っている$G_F$から独立に、その表現を学習する。興味深いことに、独立力学はエネルギー関数$E$の勾配を正確に追従し、その局所極大はまさに出現する特徴である。これらの局所オプティマ誘導特徴が一般化可能か,その表現力,および群演算タスクにおけるサンプルサイズの変化について検討する。最後に、(III)では、隠れたノードがどのように相互作用し、学習すべき機能に焦点を合わせるためにG_F$がどのように変化するかを確実に示します。本研究は,重量減少,学習速度,サンプルサイズなどの重要なハイパーパラメータの役割に光を当てることで,記憶のスケーリング法則や一般化の証明可能な法則を導出し,近年のムオンのような最適化が有効な理由を,勾配力学の第一原理から明らかにした。我々の分析は多層アーキテクチャにまで拡張できる。

論文の概要: $\mathbf{Li_2}$: A Framework on Dynamics of Feature Emergence and Delayed Generalization

関連論文リスト