Related papers: The Hidden Power of Normalization: Exponential Capacity Control in Deep Neural Networks

The Hidden Power of Normalization: Exponential Capacity Control in Deep Neural Networks

URL: http://arxiv.org/abs/2511.00958v1
Date: Sun, 02 Nov 2025 14:38:20 GMT
Title: The Hidden Power of Normalization: Exponential Capacity Control in Deep Neural Networks
Authors: Khoat Than,
Abstract summary: We develop a theoretical framework that elucidates the role of normalization through the lens of capacity control.<n>We prove that an unnormalized DNN can exhibit exponentially large Lipschitz constants with respect to either its parameters or inputs.<n>In contrast, the insertion of normalization layers provably can reduce the Lipschitz constant at an exponential rate in the number of normalization operations.
Score: 3.2356128177594363
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Normalization methods are fundamental components of modern deep neural networks (DNNs). Empirically, they are known to stabilize optimization dynamics and improve generalization. However, the underlying theoretical mechanism by which normalization contributes to both optimization and generalization remains largely unexplained, especially when using many normalization layers in a DNN architecture. In this work, we develop a theoretical framework that elucidates the role of normalization through the lens of capacity control. We prove that an unnormalized DNN can exhibit exponentially large Lipschitz constants with respect to either its parameters or inputs, implying excessive functional capacity and potential overfitting. Such bad DNNs are uncountably many. In contrast, the insertion of normalization layers provably can reduce the Lipschitz constant at an exponential rate in the number of normalization operations. This exponential reduction yields two fundamental consequences: (1) it smooths the loss landscape at an exponential rate, facilitating faster and more stable optimization; and (2) it constrains the effective capacity of the network, thereby enhancing generalization guarantees on unseen data. Our results thus offer a principled explanation for the empirical success of normalization methods in deep learning.

Related papers

Precise gradient descent training dynamics for finite-width multi-layer neural networks [8.057006406834466]
We provide the first precise distributional characterization of gradient descent iterates for general multi-layer neural networks.<n>Our non-asymptotic state evolution theory captures Gaussian fluctuations in first-layer weights and concentration in deeper-layer weights.
arXiv Detail & Related papers (2025-05-08T02:19:39Z)
A Near Complete Nonasymptotic Generalization Theory For Multilayer Neural Networks: Beyond the Bias-Variance Tradeoff [57.25901375384457]
We propose a nonasymptotic generalization theory for multilayer neural networks with arbitrary Lipschitz activations and general Lipschitz loss functions.<n>In particular, it doens't require the boundness of loss function, as commonly assumed in the literature.<n>We show the near minimax optimality of our theory for multilayer ReLU networks for regression problems.
arXiv Detail & Related papers (2025-03-03T23:34:12Z)
Improving the Expressive Power of Deep Neural Networks through Integral Activation Transform [12.36064367319084]
We generalize the traditional fully connected deep neural network (DNN) through the concept of continuous width. We show that IAT-ReLU exhibits a continuous activation pattern when continuous basis functions are employed. Our numerical experiments demonstrate that IAT-ReLU outperforms regular ReLU in terms of trainability and better smoothness.
arXiv Detail & Related papers (2023-12-19T20:23:33Z)
The Inductive Bias of Flatness Regularization for Deep Matrix Factorization [58.851514333119255]
This work takes the first step toward understanding the inductive bias of the minimum trace of the Hessian solutions in deep linear networks. We show that for all depth greater than one, with the standard Isometry Property (RIP) on the measurements, minimizing the trace of Hessian is approximately equivalent to minimizing the Schatten 1-norm of the corresponding end-to-end matrix parameters.
arXiv Detail & Related papers (2023-06-22T23:14:57Z)
Benign Overfitting in Deep Neural Networks under Lazy Training [72.28294823115502]
We show that when the data distribution is well-separated, DNNs can achieve Bayes-optimal test error for classification. Our results indicate that interpolating with smoother functions leads to better generalization.
arXiv Detail & Related papers (2023-05-30T19:37:44Z)
Globally Optimal Training of Neural Networks with Threshold Activation Functions [63.03759813952481]
We study weight decay regularized training problems of deep neural networks with threshold activations. We derive a simplified convex optimization formulation when the dataset can be shattered at a certain layer of the network.
arXiv Detail & Related papers (2023-03-06T18:59:13Z)
Understanding the Generalization Benefit of Normalization Layers: Sharpness Reduction [36.83448475700536]
Normalization layers (e.g., Batch Normalization, Layer Normalization) were introduced to help with optimization difficulties in very deep nets. This paper gives mathematical analysis and experiments suggesting that normalization encourages GD to reduce the sharpness of loss surface.
arXiv Detail & Related papers (2022-06-14T18:19:05Z)
Optimization Theory for ReLU Neural Networks Trained with Normalization Layers [82.61117235807606]
The success of deep neural networks in part due to the use of normalization layers. Our analysis shows how the introduction of normalization changes the landscape and can enable faster activation.
arXiv Detail & Related papers (2020-06-11T23:55:54Z)
Revisiting Initialization of Neural Networks [72.24615341588846]
We propose a rigorous estimation of the global curvature of weights across layers by approximating and controlling the norm of their Hessian matrix. Our experiments on Word2Vec and the MNIST/CIFAR image classification tasks confirm that tracking the Hessian norm is a useful diagnostic tool.
arXiv Detail & Related papers (2020-04-20T18:12:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.