Fugu-MT 論文翻訳(概要): The Hidden Width of Deep ResNets: Tight Error Bounds and Phase Diagrams

論文の概要: The Hidden Width of Deep ResNets: Tight Error Bounds and Phase Diagrams

arxiv url: http://arxiv.org/abs/2509.10167v1
Date: Fri, 12 Sep 2025 11:51:44 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-15 16:03:08.072791
Title: The Hidden Width of Deep ResNets: Tight Error Bounds and Phase Diagrams
Title（参考訳）: 深部ResNetの隠れ幅:タイトエラー境界と位相図
Authors: Lénaïc Chizat,
Abstract要約: 大深度残差ネットワーク(ResNets)の勾配に基づくトレーニングについて検討する。可変深さ$L$,固定埋め込み寸法$D$,任意の隠れ幅$M$で、トレーニングダイナミクスはニューラル平均ODEトレーニングダイナミクスに収束することを示す。
参考スコア（独自算出の注目度）: 15.246178589173523
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We study the gradient-based training of large-depth residual networks (ResNets) from standard random initializations. We show that with a diverging depth $L$, a fixed embedding dimension $D$, and an arbitrary hidden width $M$, the training dynamics converges to a Neural Mean ODE training dynamics. Remarkably, the limit is independent of the scaling of $M$, covering practical cases of, say, Transformers, where $M$ (the number of hidden units or attention heads per layer) is typically of the order of $D$. For a residual scale $\Theta_D\big(\frac{\alpha}{LM}\big)$, we obtain the error bound $O_D\big(\frac{1}{L}+ \frac{\alpha}{\sqrt{LM}}\big)$ between the model's output and its limit after a fixed number gradient of steps, and we verify empirically that this rate is tight. When $\alpha=\Theta(1)$, the limit exhibits complete feature learning, i.e. the Mean ODE is genuinely non-linearly parameterized. In contrast, we show that $\alpha \to \infty$ yields a \lazy ODE regime where the Mean ODE is linearly parameterized. We then focus on the particular case of ResNets with two-layer perceptron blocks, for which we study how these scalings depend on the embedding dimension $D$. We show that for this model, the only residual scale that leads to complete feature learning is $\Theta\big(\frac{\sqrt{D}}{LM}\big)$. In this regime, we prove the error bound $O\big(\frac{1}{L}+ \frac{\sqrt{D}}{\sqrt{LM}}\big)$ between the ResNet and its limit after a fixed number of gradient steps, which is also empirically tight. Our convergence results rely on a novel mathematical perspective on ResNets : (i) due to the randomness of the initialization, the forward and backward pass through the ResNet behave as the stochastic approximation of certain mean ODEs, and (ii) by propagation of chaos (that is, asymptotic independence of the units) this behavior is preserved through the training dynamics.
Abstract（参考訳）: 本研究では,大深度残差ネットワーク(ResNet)の標準乱数初期化による勾配に基づくトレーニングについて検討する。可変深さ$L$,固定埋め込み寸法$D$,任意の隠れ幅$M$で、トレーニングダイナミクスはニューラル平均ODEトレーニングダイナミクスに収束することを示す。注目すべきは、この制限は$M$のスケーリングとは無関係で、例えばTransformersの実践的なケースでは、$M$(隠されたユニット数またはレイヤ毎のアテンションヘッド数)は通常$D$のオーダーである。残留スケール $\Theta_D\big(\frac{\alpha}{LM}\big)$ に対して、モデルの出力とステップの固定数勾配の後の極限の間の誤差境界 $O_D\big(\frac{1}{L}+ \frac{\alpha}{\sqrt{LM}}\big)$ を得る。 $\alpha=\Theta(1)$ の場合、この極限は完全な特徴学習、すなわち平均ODE は真に非線型パラメータ化される。対照的に、$\alpha \to \infty$ は、平均ODE が線型パラメータ化されるような \lazy ODE 状態をもたらすことを示す。次に、2層パーセプトロンブロックを持つResNetsの特定のケースに焦点を当て、これらのスケーリングがどのようにして$D$の埋め込み次元に依存するかを研究する。このモデルでは、完全な特徴学習につながる唯一の残留スケールが$\Theta\big(\frac{\sqrt{D}}{LM}\big)$であることを示す。この状態において、ResNet とその極限の間の誤差境界 $O\big(\frac{1}{L}+ \frac{\sqrt{D}}{\sqrt{LM}}\big)$ を、固定された勾配ステップの後に証明する。我々の収束結果はResNetsにおける新しい数学的視点に依存している。 i)初期化のランダム性のため、ResNetの前後通過は、ある平均ODEの確率近似として振る舞う。 (II)カオスの伝播(すなわち単位の漸近的独立)により、この挙動は訓練力学を通して保存される。

論文の概要: The Hidden Width of Deep ResNets: Tight Error Bounds and Phase Diagrams

関連論文リスト