Related papers: Understanding Scaling Laws in Deep Neural Networks via Feature Learning Dynamics

Understanding Scaling Laws in Deep Neural Networks via Feature Learning Dynamics

URL: http://arxiv.org/abs/2512.21075v1
Date: Wed, 24 Dec 2025 09:39:04 GMT
Title: Understanding Scaling Laws in Deep Neural Networks via Feature Learning Dynamics
Authors: Zihan Yao, Ruoyu Wu, Tianxiang Gao,
Abstract summary: We show that scaling laws describe what success looks like but not when and why scaling succeeds or fails.<n>A central obstacle is the lack of a rigorous understanding of feature learning at large depth.
Score: 9.885471525709113
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The empirical success of deep learning is often attributed to scaling laws that predict consistent gains as model, data, and compute grow; however, large models can exhibit training instability and diminishing returns, suggesting that scaling laws describe what success looks like but not when and why scaling succeeds or fails. A central obstacle is the lack of a rigorous understanding of feature learning at large depth. While muP characterizes feature-learning dynamics in the infinite-width limit and enables hyperparameter transfer across width, its depth extension (depth-muP) breaks down for residual blocks with more than one internal layer. We derive Neural Feature Dynamics (NFD) for ResNets with single-layer residual blocks, characterizing feature learning via a coupled forward-backward stochastic system in the joint infinite-width and infinite-depth limit. In this regime, NFD identifies when scaling-law trends persist and explains diminishing returns. It also reveals a vanishing mechanism induced by the 1/sqrt(depth) residual scaling under which the gradient-independence assumption (GIA), known to fail during training at finite depth, becomes provably valid again at infinite depth, yielding an analytically tractable regime for end-to-end feature learning. Motivated by this insight, we study two-layer residual blocks and show that the same mechanism causes feature-learning collapse in the first internal layer at large depth, providing a structural explanation for the empirical failure of depth-muP. Based on this diagnosis, we propose a depth-aware learning-rate correction that counteracts the collapse and empirically restores depth-wise hyperparameter transfer, yielding stronger performance in deeper ResNets.

Related papers

ANCRe: Adaptive Neural Connection Reassignment for Efficient Depth Scaling [57.91760520589592]
Scaling network depth has been a central driver behind the success of modern foundation models.<n>This paper revisits the default mechanism for deepening neural networks, namely residual connections.<n>We introduce adaptive neural connection reassignment (ANCRe), a principled and lightweight framework that parameterizes and learns residual connectivities from the data.
arXiv Detail & Related papers (2026-02-09T18:54:18Z)
Asymptotic analysis of shallow and deep forgetting in replay with Neural Collapse [32.34050220649143]
A persistent paradox in continual learning (CL) is that neural networks often retain linearly separable representations of past tasks even when their output predictions fail.<n>We reveal a critical asymmetry in Experience Replay: while minimal buffers successfully anchor feature geometry, mitigating shallow forgetting typically requires substantially larger buffers.
arXiv Detail & Related papers (2025-12-08T10:35:57Z)
Step by Step Network [56.413861208019576]
Scaling up network depth is a fundamental pursuit in neural architecture design.<n>In this paper, we identify two key barriers that obstruct residual models from scaling deeper: shortcut degradation and limited width.<n>We propose a generalized residual architecture dubbed Step by Step Network (StepsNet) to bridge the gap between theoretical potential and practical performance.
arXiv Detail & Related papers (2025-11-18T10:35:49Z)
Network Sparsity Unlocks the Scaling Potential of Deep Reinforcement Learning [57.3885832382455]
We show that introducing static network sparsity alone can unlock further scaling potential beyond dense counterparts with state-of-the-art architectures.<n>Our analysis reveals that, in contrast to naively scaling up dense DRL networks, such sparse networks achieve both higher parameter efficiency for network expressivity.
arXiv Detail & Related papers (2025-06-20T17:54:24Z)
Optimal Depth of Neural Networks [2.1756081703276]
This paper introduces a formal theoretical framework to address Determining the optimal depth of a neural network.<n>We model the layer-by-layer evolution of hidden representations as a sequential decision process.<n>We propose a novel and practical regularization term, $mathcalL_rm depth$, that encourages the network to learn representations amenable to efficient, early exiting.
arXiv Detail & Related papers (2025-06-20T09:26:01Z)
Neural Collapse is Globally Optimal in Deep Regularized ResNets and Transformers [33.441694038617506]
We prove that global optima of deep regularized transformers and residual networks (ResNets) with LayerNorm trained with cross entropy or mean squared error loss are approximately collapsed.<n>Our theoretical results are supported by experiments on computer vision and language datasets showing that, as the depth grows, neural collapse indeed becomes more prominent.
arXiv Detail & Related papers (2025-05-21T08:16:03Z)
Global Convergence and Rich Feature Learning in $L$-Layer Infinite-Width Neural Networks under $μ$P Parametrization [66.03821840425539]
In this paper, we investigate the training dynamics of $L$-layer neural networks using the tensor gradient program (SGD) framework.<n>We show that SGD enables these networks to learn linearly independent features that substantially deviate from their initial values.<n>This rich feature space captures relevant data information and ensures that any convergent point of the training process is a global minimum.
arXiv Detail & Related papers (2025-03-12T17:33:13Z)
Feature Learning Beyond the Edge of Stability [8.430481660019451]
We propose a homogeneous multilayer perceptron parameterization with hidden layer width pattern and analyze its training dynamics under gradient descent.<n>We obtain formulas for the first three Taylor coefficients of the minibatch loss during training that illuminate the connection between sharpness and feature learning.
arXiv Detail & Related papers (2025-02-18T18:23:33Z)
Super Consistency of Neural Network Landscapes and Learning Rate Transfer [72.54450821671624]
We study the landscape through the lens of the loss Hessian. We find that certain spectral properties under $mu$P are largely independent of the size of the network. We show that in the Neural Tangent Kernel (NTK) and other scaling regimes, the sharpness exhibits very different dynamics at different scales.
arXiv Detail & Related papers (2024-02-27T12:28:01Z)
On the Dynamics Under the Unhinged Loss and Beyond [104.49565602940699]
We introduce the unhinged loss, a concise loss function, that offers more mathematical opportunities to analyze closed-form dynamics. The unhinged loss allows for considering more practical techniques, such as time-vary learning rates and feature normalization.
arXiv Detail & Related papers (2023-12-13T02:11:07Z)
Addressing caveats of neural persistence with deep graph persistence [54.424983583720675]
We find that the variance of network weights and spatial concentration of large weights are the main factors that impact neural persistence. We propose an extension of the filtration underlying neural persistence to the whole neural network instead of single layers. This yields our deep graph persistence measure, which implicitly incorporates persistent paths through the network and alleviates variance-related issues.
arXiv Detail & Related papers (2023-07-20T13:34:11Z)
Convergence and Implicit Regularization Properties of Gradient Descent for Deep Residual Networks [7.090165638014331]
We prove linear convergence of gradient descent to a global minimum for the training of deep residual networks with constant layer width and smooth activation function. We show that the trained weights, as a function of the layer index, admits a scaling limit which is H"older continuous as the depth of the network tends to infinity.
arXiv Detail & Related papers (2022-04-14T22:50:28Z)
Statistical Mechanics of Deep Linear Neural Networks: The Back-Propagating Renormalization Group [4.56877715768796]
We study the statistical mechanics of learning in Deep Linear Neural Networks (DLNNs) in which the input-output function of an individual unit is linear. We solve exactly the network properties following supervised learning using an equilibrium Gibbs distribution in the weight space. Our numerical simulations reveal that despite the nonlinearity, the predictions of our theory are largely shared by ReLU networks with modest depth.
arXiv Detail & Related papers (2020-12-07T20:08:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.