Related papers: Arithmetic-Mean $μ$P for Modern Architectures: A Unified Learning-Rate Scale for CNNs and ResNets

Arithmetic-Mean $μ$P for Modern Architectures: A Unified Learning-Rate Scale for CNNs and ResNets

URL: http://arxiv.org/abs/2510.04327v1
Date: Sun, 05 Oct 2025 19:22:50 GMT
Title: Arithmetic-Mean $μ$P for Modern Architectures: A Unified Learning-Rate Scale for CNNs and ResNets
Authors: Haosong Zhang, Shenxi Wu, Yichi Zhang, Wei Lin,
Abstract summary: Arithmetic-Mean $mu$P constrains not each individual layer but the network-wide average one-step pre-activation second moment to a constant scale.<n>We prove that, for one- and two-dimensional convolutional networks, the maximal-update learning rate satisfies $etastar(L)propto L-3/2$; with zero padding, boundary effects are constant-level as $Ngg k$.
Score: 9.94514344279733
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Choosing an appropriate learning rate remains a key challenge in scaling depth of modern deep networks. The classical maximal update parameterization ($\mu$P) enforces a fixed per-layer update magnitude, which is well suited to homogeneous multilayer perceptrons (MLPs) but becomes ill-posed in heterogeneous architectures where residual accumulation and convolutions introduce imbalance across layers. We introduce Arithmetic-Mean $\mu$P (AM-$\mu$P), which constrains not each individual layer but the network-wide average one-step pre-activation second moment to a constant scale. Combined with a residual-aware He fan-in initialization - scaling residual-branch weights by the number of blocks ($\mathrm{Var}[W]=c/(K\cdot \mathrm{fan\text{-}in})$) - AM-$\mu$P yields width-robust depth laws that transfer consistently across depths. We prove that, for one- and two-dimensional convolutional networks, the maximal-update learning rate satisfies $\eta^\star(L)\propto L^{-3/2}$; with zero padding, boundary effects are constant-level as $N\gg k$. For standard residual networks with general conv+MLP blocks, we establish $\eta^\star(L)=\Theta(L^{-3/2})$, with $L$ the minimal depth. Empirical results across a range of depths confirm the $-3/2$ scaling law and enable zero-shot learning-rate transfer, providing a unified and practical LR principle for convolutional and deep residual networks without additional tuning overhead.

Related papers

Robust Layerwise Scaling Rules by Proper Weight Decay Tuning [50.11170157029911]
In modern scale-invariant architectures, training quickly enters an degrading-governed steady state.<n>We introduce a weight-decay scaling rule for AdamW that preserves sublayer gain across widths.<n>Our results extend $mu$P beyond the near-init regime by explicitly controlling the steady-state scales set by parameters.
arXiv Detail & Related papers (2025-10-17T02:58:35Z)
The Hidden Width of Deep ResNets: Tight Error Bounds and Phase Diagrams [15.246178589173523]
We study the gradient-based training of large-depth residual networks (ResNets)<n>We show that with a diverging depth $L$, a fixed embedding dimension $D$, and an arbitrary hidden width $M$, the training dynamics converges to a Neural Mean ODE training dynamics.
arXiv Detail & Related papers (2025-09-12T11:51:44Z)
Scaling Up Liquid-Resistance Liquid-Capacitance Networks for Efficient Sequence Modeling [50.994194925685434]
LrcSSM is a $textitnon-linear$ recurrent model that processes long sequences as fast as today's linear state-space layers.<n>By forcing the Jacobian matrix to be diagonal, the full sequence can be solved in parallel.<n>LrcSSM offers a formal gradient-stability guarantee that other input-varying systems such as Liquid-S4 do not provide.
arXiv Detail & Related papers (2025-05-27T20:02:59Z)
Emergence and scaling laws in SGD learning of shallow neural networks [64.48316762675141]
We study the complexity of online gradient descent (SGD) for learning a two-layer neural network with $P$ neurons on isotropic Gaussian data.<n>We provide a precise analysis of SGD dynamics for the training of a student two-layer network to minimize the mean squared error (MSE) objective.
arXiv Detail & Related papers (2025-04-28T16:58:55Z)
Constructive Universal Approximation and Finite Sample Memorization by Narrow Deep ReLU Networks [0.0]
We show that any dataset with $N$ distinct points in $mathbbRd$ and $M$ output classes can be exactly classified.<n>We also prove a universal approximation theorem in $Lp(Omega; mathbbRm)$ for any bounded domain.<n>Our results offer a unified and interpretable framework connecting controllability, expressivity, and training dynamics in deep neural networks.
arXiv Detail & Related papers (2024-09-10T14:31:21Z)
Sharp Generalization for Nonparametric Regression in Interpolation Space by Over-Parameterized Neural Networks Trained with Preconditioned Gradient Descent and Early Stopping [15.975065054204753]
We study non regression using an over-parametricized two-layer neural networks trained with algorithmic guarantees.<n>We demonstrate that training the neural network with a novel Preconditioned Gradient Descent (PGD) algorithm, equipped with early stopping, achieves a sharp regression rate.
arXiv Detail & Related papers (2024-07-16T03:38:34Z)
Improved Algorithm for Adversarial Linear Mixture MDPs with Bandit Feedback and Unknown Transition [71.33787410075577]
We study reinforcement learning with linear function approximation, unknown transition, and adversarial losses. We propose a new algorithm that attains an $widetildeO(dsqrtHS3K + sqrtHSAK)$ regret with high probability.
arXiv Detail & Related papers (2024-03-07T15:03:50Z)
Differential Equation Scaling Limits of Shaped and Unshaped Neural Networks [8.716913598251386]
We find similar differential equation based characterization for two types of unshaped networks. We derive the first order correction to the layerwise correlation. These results together provide a connection between shaped and unshaped network architectures.
arXiv Detail & Related papers (2023-10-18T16:15:10Z)
Depth Dependence of $\mu$P Learning Rates in ReLU MLPs [72.14317069090407]
We study the dependence on $n$ and $L$ of the maximal update ($mu$P) learning rate. We find that it has a non-trivial dependence of $L$, scaling like $L-3/2.$
arXiv Detail & Related papers (2023-05-13T01:10:49Z)
The Onset of Variance-Limited Behavior for Networks in the Lazy and Rich Regimes [75.59720049837459]
We study the transition from infinite-width behavior to this variance limited regime as a function of sample size $P$ and network width $N$. We find that finite-size effects can become relevant for very small datasets on the order of $P* sim sqrtN$ for regression with ReLU networks.
arXiv Detail & Related papers (2022-12-23T04:48:04Z)
Improved techniques for deterministic l2 robustness [63.34032156196848]
Training convolutional neural networks (CNNs) with a strict 1-Lipschitz constraint under the $l_2$ norm is useful for adversarial robustness, interpretable gradients and stable training. We introduce a procedure to certify robustness of 1-Lipschitz CNNs by replacing the last linear layer with a 1-hidden layer. We significantly advance the state-of-the-art for standard and provable robust accuracies on CIFAR-10 and CIFAR-100.
arXiv Detail & Related papers (2022-11-15T19:10:12Z)
Bounding the Width of Neural Networks via Coupled Initialization -- A Worst Case Analysis [121.9821494461427]
We show how to significantly reduce the number of neurons required for two-layer ReLU networks. We also prove new lower bounds that improve upon prior work, and that under certain assumptions, are best possible.
arXiv Detail & Related papers (2022-06-26T06:51:31Z)
Training Multi-Layer Over-Parametrized Neural Network in Subquadratic Time [12.348083977777833]
We consider the problem of training a multi-layer over-parametrized neural network to minimize the empirical risk induced by a loss function. In this work, we show how to reduce the training cost per iteration.
arXiv Detail & Related papers (2021-12-14T18:13:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.