The Hidden Width of Deep ResNets: Tight Error Bounds and Phase Diagrams
- URL: http://arxiv.org/abs/2509.10167v1
- Date: Fri, 12 Sep 2025 11:51:44 GMT
- Title: The Hidden Width of Deep ResNets: Tight Error Bounds and Phase Diagrams
- Authors: Lénaïc Chizat,
- Abstract summary: We study the gradient-based training of large-depth residual networks (ResNets)<n>We show that with a diverging depth $L$, a fixed embedding dimension $D$, and an arbitrary hidden width $M$, the training dynamics converges to a Neural Mean ODE training dynamics.
- Score: 15.246178589173523
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We study the gradient-based training of large-depth residual networks (ResNets) from standard random initializations. We show that with a diverging depth $L$, a fixed embedding dimension $D$, and an arbitrary hidden width $M$, the training dynamics converges to a Neural Mean ODE training dynamics. Remarkably, the limit is independent of the scaling of $M$, covering practical cases of, say, Transformers, where $M$ (the number of hidden units or attention heads per layer) is typically of the order of $D$. For a residual scale $\Theta_D\big(\frac{\alpha}{LM}\big)$, we obtain the error bound $O_D\big(\frac{1}{L}+ \frac{\alpha}{\sqrt{LM}}\big)$ between the model's output and its limit after a fixed number gradient of steps, and we verify empirically that this rate is tight. When $\alpha=\Theta(1)$, the limit exhibits complete feature learning, i.e. the Mean ODE is genuinely non-linearly parameterized. In contrast, we show that $\alpha \to \infty$ yields a \lazy ODE regime where the Mean ODE is linearly parameterized. We then focus on the particular case of ResNets with two-layer perceptron blocks, for which we study how these scalings depend on the embedding dimension $D$. We show that for this model, the only residual scale that leads to complete feature learning is $\Theta\big(\frac{\sqrt{D}}{LM}\big)$. In this regime, we prove the error bound $O\big(\frac{1}{L}+ \frac{\sqrt{D}}{\sqrt{LM}}\big)$ between the ResNet and its limit after a fixed number of gradient steps, which is also empirically tight. Our convergence results rely on a novel mathematical perspective on ResNets : (i) due to the randomness of the initialization, the forward and backward pass through the ResNet behave as the stochastic approximation of certain mean ODEs, and (ii) by propagation of chaos (that is, asymptotic independence of the units) this behavior is preserved through the training dynamics.
Related papers
- A Gap Between Decision Trees and Neural Networks [2.4140387101794283]
We study when geometric simplicity of decision boundaries, used here as a notion of interpretability, can conflict with accurate approximation of axis-aligned decision trees by shallow neural networks.<n>For classification, we construct a smooth barrier score $S_A$ with finite $mathrmRmathrmTV$ whose fixed threshold $=1$ exactly recovers the box.<n>Experiments on synthetic unions of rectangles illustrate the resulting accuracy--complexity tradeoff.
arXiv Detail & Related papers (2026-01-07T13:40:30Z) - Arithmetic-Mean $μ$P for Modern Architectures: A Unified Learning-Rate Scale for CNNs and ResNets [9.94514344279733]
Arithmetic-Mean $mu$P constrains not each individual layer but the network-wide average one-step pre-activation second moment to a constant scale.<n>We prove that, for one- and two-dimensional convolutional networks, the maximal-update learning rate satisfies $etastar(L)propto L-3/2$; with zero padding, boundary effects are constant-level as $Ngg k$.
arXiv Detail & Related papers (2025-10-05T19:22:50Z) - Linear regression with overparameterized linear neural networks: Tight upper and lower bounds for implicit $\ell^1$-regularization [3.902441198412341]
We study implicit regularization in diagonal linear neural networks of depth $Dge 2$ for overparameterized linear regression problems.<n>Our results reveal a qualitative difference between depths: for $D ge 3$, the error decreases linearly with $alpha$, whereas for $D=2$, it decreases at rate $alpha1-varrho$.<n> Numerical experiments corroborate our theoretical findings and suggest that deeper networks, i.e., $D ge 3$, may lead to better generalization.
arXiv Detail & Related papers (2025-06-01T19:55:31Z) - Emergence and scaling laws in SGD learning of shallow neural networks [64.48316762675141]
We study the complexity of online gradient descent (SGD) for learning a two-layer neural network with $P$ neurons on isotropic Gaussian data.<n>We provide a precise analysis of SGD dynamics for the training of a student two-layer network to minimize the mean squared error (MSE) objective.
arXiv Detail & Related papers (2025-04-28T16:58:55Z) - Uncertainty Quantification From Scaling Laws in Deep Neural Networks [0.0]
Quantifying uncertainty from machine learning analyses is critical to their use in the physical sciences.<n>We compute the mean $mu_mathcalL$ and variance $sigma_mathcalL$ for an ensemble of multi-layer perceptrons.<n>We compare empirically to the results from finite-width networks for three example tasks: MNIST classification, CIFAR classification and calorimeter energy regression.
arXiv Detail & Related papers (2025-03-07T21:15:11Z) - Neural network learns low-dimensional polynomials with SGD near the information-theoretic limit [75.4661041626338]
We study the problem of gradient descent learning of a single-index target function $f_*(boldsymbolx) = textstylesigma_*left(langleboldsymbolx,boldsymbolthetarangleright)$<n>We prove that a two-layer neural network optimized by an SGD-based algorithm learns $f_*$ with a complexity that is not governed by information exponents.
arXiv Detail & Related papers (2024-06-03T17:56:58Z) - Differential Equation Scaling Limits of Shaped and Unshaped Neural Networks [8.716913598251386]
We find similar differential equation based characterization for two types of unshaped networks.
We derive the first order correction to the layerwise correlation.
These results together provide a connection between shaped and unshaped network architectures.
arXiv Detail & Related papers (2023-10-18T16:15:10Z) - Depth Dependence of $\mu$P Learning Rates in ReLU MLPs [72.14317069090407]
We study the dependence on $n$ and $L$ of the maximal update ($mu$P) learning rate.
We find that it has a non-trivial dependence of $L$, scaling like $L-3/2.$
arXiv Detail & Related papers (2023-05-13T01:10:49Z) - Horizon-Free and Variance-Dependent Reinforcement Learning for Latent
Markov Decision Processes [62.90204655228324]
We study regret minimization for reinforcement learning (RL) in Latent Markov Decision Processes (LMDPs) with context in hindsight.
We design a novel model-based algorithmic framework which can be instantiated with both a model-optimistic and a value-optimistic solver.
arXiv Detail & Related papers (2022-10-20T21:32:01Z) - High-dimensional Asymptotics of Feature Learning: How One Gradient Step
Improves the Representation [89.21686761957383]
We study the first gradient descent step on the first-layer parameters $boldsymbolW$ in a two-layer network.
Our results demonstrate that even one step can lead to a considerable advantage over random features.
arXiv Detail & Related papers (2022-05-03T12:09:59Z) - Universal Approximation Properties for an ODENet and a ResNet:
Mathematical Analysis and Numerical Experiments [0.0]
We prove a universal approximation property (UAP) for a class of ODENet and a class of ResNet.
We use this to construct a learning algorithm for ODENet.
arXiv Detail & Related papers (2020-12-22T06:04:09Z) - Large-time asymptotics in deep learning [0.0]
We consider the impact of the final time $T$ (which may indicate the depth of a corresponding ResNet) in training.
For the classical $L2$--regularized empirical risk minimization problem, we show that the training error is at most of the order $mathcalOleft(frac1Tright)$.
In the setting of $ellp$--distance losses, we prove that both the training error and the optimal parameters are at most of the order $mathcalOleft(e-mu
arXiv Detail & Related papers (2020-08-06T07:33:17Z) - Naive Exploration is Optimal for Online LQR [49.681825576239355]
We show that the optimal regret scales as $widetildeTheta(sqrtd_mathbfu2 d_mathbfx T)$, where $T$ is the number of time steps, $d_mathbfu$ is the dimension of the input space, and $d_mathbfx$ is the dimension of the system state.
Our lower bounds rule out the possibility of a $mathrmpoly(logT)$-regret algorithm, which had been
arXiv Detail & Related papers (2020-01-27T03:44:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.