Where You Place the Norm Matters: From Prejudiced to Neutral Initializations
- URL: http://arxiv.org/abs/2505.11312v3
- Date: Tue, 27 May 2025 06:51:38 GMT
- Title: Where You Place the Norm Matters: From Prejudiced to Neutral Initializations
- Authors: Emanuele Francazi, Francesco Pinto, Aurelien Lucchi, Marco Baity-Jesi,
- Abstract summary: Normalization layers, such as Batch Normalization and Layer Normalization, are central components in modern neural networks.<n>We study how the presence and placement of normalization within hidden layers influence the statistical properties of network predictions before training begins.<n>Our work provides a principled understanding of how normalization can influence early training behavior and offers guidance for more controlled and interpretable network design.
- Score: 5.070645558119592
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Normalization layers, such as Batch Normalization and Layer Normalization, are central components in modern neural networks, widely adopted to improve training stability and generalization. While their practical effectiveness is well documented, a detailed theoretical understanding of how normalization affects model behavior, starting from initialization, remains an important open question. In this work, we investigate how both the presence and placement of normalization within hidden layers influence the statistical properties of network predictions before training begins. In particular, we study how these choices shape the distribution of class predictions at initialization, which can range from unbiased (Neutral) to highly concentrated (Prejudiced) toward a subset of classes. Our analysis shows that normalization placement induces systematic differences in the initial prediction behavior of neural networks, which in turn shape the dynamics of learning. By linking architectural choices to prediction statistics at initialization, our work provides a principled understanding of how normalization can influence early training behavior and offers guidance for more controlled and interpretable network design.
Related papers
- When the Left Foot Leads to the Right Path: Bridging Initial Prejudice and Trainability [0.07767214588770123]
Mean-field (MF) analyses have demonstrated that the parameter distribution in randomly networks dictates whether gradients vanish or explode.<n>In untrained DNNs, large regions of the input space are assigned to a single class.<n>In this work, we derive a theoretical proof establishing the correspondence between IGB and previous MF theories.
arXiv Detail & Related papers (2025-05-17T17:31:56Z) - On the Generalization Ability of Unsupervised Pretraining [53.06175754026037]
Recent advances in unsupervised learning have shown that unsupervised pre-training, followed by fine-tuning, can improve model generalization.
This paper introduces a novel theoretical framework that illuminates the critical factor influencing the transferability of knowledge acquired during unsupervised pre-training to the subsequent fine-tuning phase.
Our results contribute to a better understanding of unsupervised pre-training and fine-tuning paradigm, and can shed light on the design of more effective pre-training algorithms.
arXiv Detail & Related papers (2024-03-11T16:23:42Z) - Initial Guessing Bias: How Untrained Networks Favor Some Classes [0.09103230894909536]
We show that the structure of a deep neural network (DNN) can condition the model to assign all predictions to the same class, even before the beginning of training.
We prove that, besides dataset properties, the presence of this phenomenon is influenced by model choices including dataset preprocessing methods.
We highlight theoretical consequences, such as the breakdown of node-permutation symmetry and the violation of self-averaging.
arXiv Detail & Related papers (2023-06-01T15:37:32Z) - Subquadratic Overparameterization for Shallow Neural Networks [60.721751363271146]
We provide an analytical framework that allows us to adopt standard neural training strategies.
We achieve the desiderata viaak-Lojasiewicz, smoothness, and standard assumptions.
arXiv Detail & Related papers (2021-11-02T20:24:01Z) - Normalization Techniques in Training DNNs: Methodology, Analysis and
Application [111.82265258916397]
Normalization techniques are essential for accelerating the training and improving the generalization of deep neural networks (DNNs)
This paper reviews and comments on the past, present and future of normalization methods in the context of training.
arXiv Detail & Related papers (2020-09-27T13:06:52Z) - Learning from Failure: Training Debiased Classifier from Biased
Classifier [76.52804102765931]
We show that neural networks learn to rely on spurious correlation only when it is "easier" to learn than the desired knowledge.
We propose a failure-based debiasing scheme by training a pair of neural networks simultaneously.
Our method significantly improves the training of the network against various types of biases in both synthetic and real-world datasets.
arXiv Detail & Related papers (2020-07-06T07:20:29Z) - Optimization Theory for ReLU Neural Networks Trained with Normalization
Layers [82.61117235807606]
The success of deep neural networks in part due to the use of normalization layers.
Our analysis shows how the introduction of normalization changes the landscape and can enable faster activation.
arXiv Detail & Related papers (2020-06-11T23:55:54Z) - Revisiting Initialization of Neural Networks [72.24615341588846]
We propose a rigorous estimation of the global curvature of weights across layers by approximating and controlling the norm of their Hessian matrix.
Our experiments on Word2Vec and the MNIST/CIFAR image classification tasks confirm that tracking the Hessian norm is a useful diagnostic tool.
arXiv Detail & Related papers (2020-04-20T18:12:56Z) - Topologically Densified Distributions [25.140319008330167]
We study regularization in the context of small sample-size learning with over- parameterized neural networks.
We impose a topological constraint on samples drawn from the probability measure induced in that space.
This provably leads to mass concentration effects around the representations of training instances.
arXiv Detail & Related papers (2020-02-12T05:25:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.