Fugu-MT 論文翻訳(概要): Provable Scaling Laws of Feature Emergence from Learning Dynamics of Grokking

論文の概要: Provable Scaling Laws of Feature Emergence from Learning Dynamics of Grokking

arxiv url: http://arxiv.org/abs/2509.21519v3
Date: Tue, 30 Sep 2025 17:43:09 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-01 14:44:59.840084
Title: Provable Scaling Laws of Feature Emergence from Learning Dynamics of Grokking
Title（参考訳）: グローキングの学習ダイナミクスによる特徴創発の確率的スケーリング法則
Authors: Yuandong Tian,
Abstract要約: 我々は、グルーキング現象、すなわち遅延一般化について研究する。本稿では,2層非線形ネットワークのグルーキング動作の3つの重要な段階を捉える新しいフレームワークを提案する。私たちの研究は、体重減少、学習率、グルーキングにおけるサンプルサイズといったハイパースの役割に光を当てています。
参考スコア（独自算出の注目度）: 44.614763110719274
License: http://creativecommons.org/licenses/by/4.0/
Abstract: While the phenomenon of grokking, i.e., delayed generalization, has been studied extensively, it remains an open problem whether there is a mathematical framework that characterizes what kind of features will emerge, how and in which conditions it happens, and is closely related to the gradient dynamics of the training, for complex structured inputs. We propose a novel framework, named $\mathbf{Li_2}$, that captures three key stages for the grokking behavior of 2-layer nonlinear networks: (I) \underline{\textbf{L}}azy learning, (II) \underline{\textbf{i}}ndependent feature learning and (III) \underline{\textbf{i}}nteractive feature learning. At the lazy learning stage, top layer overfits to random hidden representation and the model appears to memorize. Thanks to lazy learning and weight decay, the \emph{backpropagated gradient} $G_F$ from the top layer now carries information about the target label, with a specific structure that enables each hidden node to learn their representation \emph{independently}. Interestingly, the independent dynamics follows exactly the \emph{gradient ascent} of an energy function $E$, and its local maxima are precisely the emerging features. We study whether these local-optima induced features are generalizable, their representation power, and how they change on sample size, in group arithmetic tasks. When hidden nodes start to interact in the later stage of learning, we provably show how $G_F$ changes to focus on missing features that need to be learned. Our study sheds lights on roles played by key hyperparameters such as weight decay, learning rate and sample sizes in grokking, leads to provable scaling laws of feature emergence, memorization and generalization, and reveals the underlying cause why recent optimizers such as Muon can be effective, from the first principles of gradient dynamics. Our analysis can be extended to multi-layer architectures.
Abstract（参考訳）: グルーキングの現象、すなわち遅延一般化は広く研究されているが、複雑な構造化された入力に対して、どのような特徴が現れるのか、どのように、どこで起こるのかを特徴づける数学的枠組みが存在するかどうかは未解決のままである。 I) \underline{\textbf{L}}azy Learning, (II) \underline{\textbf{i}}ndependent feature learning, (III) \underline{\textbf{i}}nteractive feature learning。遅延学習の段階では、トップレイヤはランダムな隠れ表現に過度に適合し、モデルは記憶されるように見える。遅延学習とウェイト崩壊により、トップ層からの \emph{backpropagated gradient} $G_F$ はターゲットラベルに関する情報を格納し、各隠れノードがそれぞれの表現を独立に学習することを可能にする特定の構造を持つ。興味深いことに、独立力学はエネルギー関数 $E$ のちょうど \emph{gradient ascent} に従う。これらの局所オプティマ誘導特徴が一般化可能か,その表現力,および群演算タスクにおけるサンプルサイズの変化について検討する。隠れたノードが学習の後期に相互作用し始めると、学習すべき機能に焦点を合わせるためにG_F$がどう変わったかを確実に示します。本研究は, 重量減少, 学習速度, サンプルサイズといった重要なハイパーパラメータの役割に光を当て, 特徴の出現, 記憶, 一般化のスケーリング法則を立証し, 勾配力学の第一原理から, ムオンのような最近の最適化が有効である理由を明らかにした。我々の分析は多層アーキテクチャにまで拡張できる。

論文の概要: Provable Scaling Laws of Feature Emergence from Learning Dynamics of Grokking

関連論文リスト