Related papers: When the Left Foot Leads to the Right Path: Bridging Initial Prejudice and Trainability

When the Left Foot Leads to the Right Path: Bridging Initial Prejudice and Trainability

URL: http://arxiv.org/abs/2505.12096v2
Date: Mon, 26 May 2025 07:33:58 GMT
Title: When the Left Foot Leads to the Right Path: Bridging Initial Prejudice and Trainability
Authors: Alberto Bassi, Carlo Albert, Aurelien Lucchi, Marco Baity-Jesi, Emanuele Francazi,
Abstract summary: Mean-field (MF) analyses have demonstrated that the parameter distribution in randomly networks dictates whether gradients vanish or explode.<n>In untrained DNNs, large regions of the input space are assigned to a single class.<n>In this work, we derive a theoretical proof establishing the correspondence between IGB and previous MF theories.
Score: 0.07767214588770123
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Understanding the statistical properties of deep neural networks (DNNs) at initialization is crucial for elucidating both their trainability and the intrinsic architectural biases they encode prior to data exposure. Mean-field (MF) analyses have demonstrated that the parameter distribution in randomly initialized networks dictates whether gradients vanish or explode. Concurrently, untrained DNNs were found to exhibit an initial-guessing bias (IGB), in which large regions of the input space are assigned to a single class. In this work, we derive a theoretical proof establishing the correspondence between IGB and previous MF theories, thereby connecting a network prejudice toward specific classes with the conditions for fast and accurate learning. This connection yields the counter-intuitive conclusion: the initialization that optimizes trainability is necessarily biased, rather than neutral. Furthermore, we extend the MF/IGB framework to multi-node activation functions, offering practical guidelines for designing initialization schemes that ensure stable optimization in architectures employing max- and average-pooling layers.

Related papers

Deep Neural Network-Driven Adaptive Filtering [42.10078151647835]
This paper proposes a deep neural network (DNN)-driven framework to address the longstanding generalization challenge in adaptive filtering (AF)<n>In contrast to traditional AF frameworks that emphasize explicit cost function design, the proposed framework shifts the paradigm toward direct gradient acquisition.<n>The framework is validated by extensive numerical experiments across a spectrum of non-Gaussian scenarios.
arXiv Detail & Related papers (2025-08-06T09:42:40Z)
Initial Guessing Bias: How Untrained Networks Favor Some Classes [0.09103230894909536]
We show that the structure of a deep neural network (DNN) can condition the model to assign all predictions to the same class, even before the beginning of training. We prove that, besides dataset properties, the presence of this phenomenon is influenced by model choices including dataset preprocessing methods. We highlight theoretical consequences, such as the breakdown of node-permutation symmetry and the violation of self-averaging.
arXiv Detail & Related papers (2023-06-01T15:37:32Z)
Benign Overfitting in Deep Neural Networks under Lazy Training [72.28294823115502]
We show that when the data distribution is well-separated, DNNs can achieve Bayes-optimal test error for classification. Our results indicate that interpolating with smoother functions leads to better generalization.
arXiv Detail & Related papers (2023-05-30T19:37:44Z)
Neural Networks with Sparse Activation Induced by Large Bias: Tighter Analysis with Bias-Generalized NTK [86.45209429863858]
We study training one-hidden-layer ReLU networks in the neural tangent kernel (NTK) regime. We show that the neural networks possess a different limiting kernel which we call textitbias-generalized NTK We also study various properties of the neural networks with this new kernel.
arXiv Detail & Related papers (2023-01-01T02:11:39Z)
Neural networks trained with SGD learn distributions of increasing complexity [78.30235086565388]
We show that neural networks trained using gradient descent initially classify their inputs using lower-order input statistics. We then exploit higher-order statistics only later during training. We discuss the relation of DSB to other simplicity biases and consider its implications for the principle of universality in learning.
arXiv Detail & Related papers (2022-11-21T15:27:22Z)
Robust Learning of Parsimonious Deep Neural Networks [0.0]
We propose a simultaneous learning and pruning algorithm capable of identifying and eliminating irrelevant structures in a neural network. We derive a novel hyper-prior distribution over the prior parameters that is crucial for their optimal selection. We evaluate the proposed algorithm on the MNIST data set and commonly used fully connected and convolutional LeNet architectures.
arXiv Detail & Related papers (2022-05-10T03:38:55Z)
On the Explicit Role of Initialization on the Convergence and Implicit Bias of Overparametrized Linear Networks [1.0323063834827415]
We present a novel analysis of single-hidden-layer linear networks trained under gradient flow. We show that the squared loss converges exponentially to its optimum. We derive a novel non-asymptotic upper-bound on the distance between the trained network and the min-norm solution.
arXiv Detail & Related papers (2021-05-13T15:13:51Z)
CSNE: Conditional Signed Network Embedding [77.54225346953069]
Signed networks encode positive and negative relations between entities such as friend/foe or trust/distrust. Existing embedding methods for sign prediction generally enforce different notions of status or balance theories in their optimization function. We introduce conditional signed network embedding (CSNE) Our probabilistic approach models structural information about the signs in the network separately from fine-grained detail.
arXiv Detail & Related papers (2020-05-19T19:14:52Z)
Revisiting Initialization of Neural Networks [72.24615341588846]
We propose a rigorous estimation of the global curvature of weights across layers by approximating and controlling the norm of their Hessian matrix. Our experiments on Word2Vec and the MNIST/CIFAR image classification tasks confirm that tracking the Hessian norm is a useful diagnostic tool.
arXiv Detail & Related papers (2020-04-20T18:12:56Z)
Dynamic Hierarchical Mimicking Towards Consistent Optimization Objectives [73.15276998621582]
We propose a generic feature learning mechanism to advance CNN training with enhanced generalization ability. Partially inspired by DSN, we fork delicately designed side branches from the intermediate layers of a given neural network. Experiments on both category and instance recognition tasks demonstrate the substantial improvements of our proposed method.
arXiv Detail & Related papers (2020-03-24T09:56:13Z)
MSE-Optimal Neural Network Initialization via Layer Fusion [68.72356718879428]
Deep neural networks achieve state-of-the-art performance for a range of classification and inference tasks. The use of gradient combined nonvolutionity renders learning susceptible to novel problems. We propose fusing neighboring layers of deeper networks that are trained with random variables.
arXiv Detail & Related papers (2020-01-28T18:25:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.