Fugu-MT 論文翻訳(概要): Dimensional Criticality at Grokking Across MLPs and Transformers

論文の概要: Dimensional Criticality at Grokking Across MLPs and Transformers

arxiv url: http://arxiv.org/abs/2604.16431v1
Date: Mon, 06 Apr 2026 13:43:20 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-04 02:32:13.992994
Title: Dimensional Criticality at Grokking Across MLPs and Transformers
Title（参考訳）: MLPと変圧器のグロキングにおける次元臨界度
Authors: Ping Wang,
Abstract要約: 異なる力学系間の急激な遷移は、複雑なシステムの目印である。オフライン雪崩探査機 textbfTDU-OFC (Thresholded Diffusion Update--Olami-Feder-Christensen) を紹介する。モジュラー加算と XOR で訓練された一般化を訓練したトランスフォーマーは、拡散ベースラインの局所的交差をD=1$で発見する。
参考スコア（独自算出の注目度）: 2.652953665748039
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Abrupt transitions between distinct dynamical regimes are a hallmark of complex systems. Grokking in deep neural networks provides a striking example -- an abrupt transition from memorization to generalization long after training accuracy saturates -- yet robust macroscopic signatures of this transition remain elusive. Here we introduce \textbf{TDU--OFC} (Thresholded Diffusion Update--Olami-Feder-Christensen), an offline avalanche probe that converts gradient snapshots into cascade statistics and extracts a \emph{macroscopic observable} -- the time-resolved effective cascade dimension $D(t)$ -- via grokking-aligned finite-size scaling. Across Transformers trained on modular addition and MLPs trained on XOR, we discover a localized dynamical crossing of the Gaussian diffusion baseline $D=1$ precisely at the generalization transition. The crossing direction is task-dependent: modular addition descends through $D=1$ (approaching from $D>1$), while XOR ascends (from $D<1$). This opposite-direction convergence is consistent with attraction toward a candidate shared critical manifold, rather than trivial residence near $D \approx 1$. Negative controls confirm this picture: ungrokked runs remain supercritical ($D>1$) and never enter the post-transition regime. In addition, avalanche distributions exhibit heavy tails and finite-size scaling consistent with the dimensional exponent extracted from $D(t)$. Shadow-probe controls ($α_{\mathrm{train}}=0$) confirm that $D(t)$ is non-invasive, and grokked trajectories diverge from ungrokked ones in $D(t)$ some $100$--$200$ epochs before the behavioral transition.
Abstract（参考訳）: 異なる力学系間の急激な遷移は、複雑なシステムの目印である。深層ニューラルネットワークのグロッキングは、トレーニングの精度が飽和した後、暗記から一般化への突然の移行という驚くべき例を提供するが、この移行の堅牢なマクロなシグネチャは、いまだ解明されていない。ここでは、勾配のスナップショットをカスケード統計に変換し、時間分解有効カスケード次元である \emph{macroscopic observable} を抽出するオフラインの雪崩プローブである \textbf{TDU-OFC} (Thresholded Diffusion Update--Olami-Feder-Christensen) を紹介する。モジュラー加算と XOR で訓練された MLP で訓練された変換器を横切ると、一般化遷移において、ガウス拡散ベースライン $D=1$ の局所的動的交叉が正確に見つかる。モジュラの追加は$D=1$($D>1$から適用)、XORは$D<1$から上昇する。この反対方向収束は、$D \approx 1$ に近い自明な住居ではなく、候補共有臨界多様体へのアトラクションと一致している。負のコントロールは、このイメージを裏付けている: アングロクテッドラン(ungrokked run)は、超臨界(D>1$)のままで、移行後の政権に決して入らない。さらに、雪崩分布は、$D(t)$から抽出された次元指数と一致する重い尾と有限サイズのスケーリングを示す。シャドウプロブコントロール (α_{\mathrm{train}}=0$) は、$D(t)$が非侵襲的であることを確認し、Grokked trajectoriesは、動作遷移の前に$D(t)$約100$-$200$ epochsで分岐する。

論文の概要: Dimensional Criticality at Grokking Across MLPs and Transformers

関連論文リスト