Fugu-MT 論文翻訳(概要): Why Grokking Takes So Long: A First-Principles Theory of Representational Phase Transitions

論文の概要: Why Grokking Takes So Long: A First-Principles Theory of Representational Phase Transitions

arxiv url: http://arxiv.org/abs/2603.13331v1
Date: Thu, 05 Mar 2026 17:28:39 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-23 08:17:42.312717
Title: Why Grokking Takes So Long: A First-Principles Theory of Representational Phase Transitions
Title（参考訳）: なぜグロッキングがそんなに長く掛かるのか:表現相転移の第一原理理論
Authors: Truong Xuan Khanh, Truong Quynh Hoa, Luu Duc Trung, Phan Thanh Duc,
Abstract要約: グロッキング(Grokking)は、モデルがトレーニングデータを記憶してから長く経った突然の記憶である。正規化学習力学におけるノルム駆動表現相転移からグラッキングが生じることを示す第一原理理論を提案する。
参考スコア（独自算出の注目度）: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Grokking is the sudden generalization that appears long after a model has perfectly memorized its training data. Although this phenomenon has been widely observed, there is still no quantitative theory explaining the length of the delay between memorization and generalization. Prior work has noted that weight decay plays an important role, but no result derives tight bounds for the delay or explains its scaling behavior. We present a first-principles theory showing that grokking arises from a norm-driven representational phase transition in regularized training dynamics. Training first converges to a high-norm memorization solution and only later contracts toward a lower-norm structured representation that generalizes. Our main result establishes a scaling law for the delay: T_grok - T_mem = Theta((1 / gamma_eff) * log(||theta_mem||^2 / ||theta_post||^2)), where gamma_eff is the effective contraction rate of the optimizer (gamma_eff = eta * lambda for SGD and gamma_eff >= eta * lambda for AdamW). The upper bound follows from a discrete Lyapunov contraction argument, and the matching lower bound arises from dynamical constraints of regularized first-order optimization. Across 293 training runs spanning modular addition, modular multiplication, and sparse parity tasks, we confirm three predictions: inverse scaling with weight decay, inverse scaling with learning rate, and logarithmic dependence on the norm ratio (R^2 > 0.97). We further find that grokking requires an optimizer that can decouple memorization from contraction: SGD fails under hyperparameters where AdamW reliably groks. These results show that grokking is a predictable consequence of norm separation between competing interpolating representations and provide the first quantitative scaling law for the delay of grokking.
Abstract（参考訳）: グロキング(Grokking)は、モデルがトレーニングデータを完全に記憶した後に現れる突然の一般化である。この現象は広く観測されているが、記憶と一般化の間の遅延長を説明する定量的な理論はいまだに存在しない。以前の研究では、重量減少は重要な役割を果たすが、遅延の厳密な境界やスケーリングの振る舞いを説明する結果が得られない。正規化学習力学におけるノルム駆動表現相転移からグラッキングが生じることを示す第一原理理論を提案する。トレーニングはまず、高ノルム記憶解に収束し、後に一般化する低ノルム構造表現へのみ契約する。 T_grok - T_mem = Theta((1 / gamma_eff) * log(|theta_mem||^2 / ||theta_post||^2) ここで、γ_effはオプティマイザの有効収縮率である(gamma_eff = eta * lambda for SGD and gamma_eff >= eta * lambda for AdamW)。上界は離散的なリャプノフ収縮論から従い、一致する下界は正規化された一階最適化の動的制約から生じる。 293のトレーニングは,重み付き逆スケーリング,学習率による逆スケーリング,ノルム比による対数依存(R^2 > 0.97)の3つのタスクにまたがる。さらに、グラッキングには、記憶と収縮を分離できる最適化器が必要であることが判明した: SGDは、AdamWが確実にグロークするハイパーパラメーターの下で失敗する。これらの結果は、グラッキングが競合する補間表現間のノルム分離の予測可能な結果であることを示し、グラッキングの遅延に対する最初の定量的スケーリング法則を提供する。

論文の概要: Why Grokking Takes So Long: A First-Principles Theory of Representational Phase Transitions

関連論文リスト