Optimal Condition for Initialization Variance in Deep Neural Networks: An SGD Dynamics Perspective
- URL: http://arxiv.org/abs/2508.12834v1
- Date: Mon, 18 Aug 2025 11:18:12 GMT
- Title: Optimal Condition for Initialization Variance in Deep Neural Networks: An SGD Dynamics Perspective
- Authors: Hiroshi Horii, Sothea Has,
- Abstract summary: gradient descent (SGD) is one of the most fundamental optimization algorithms in machine learning (ML)<n>We study the relationship between the quasi-stationary distribution derived from this equation and the initial distribution through the Kullback-Leibler (KL) divergence.<n>We experimentally confirm our theoretical results by using the classical SGD to train fully connected neural networks on the MNIST and Fashion-MNIST datasets.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Stochastic gradient descent (SGD), one of the most fundamental optimization algorithms in machine learning (ML), can be recast through a continuous-time approximation as a Fokker-Planck equation for Langevin dynamics, a viewpoint that has motivated many theoretical studies. Within this framework, we study the relationship between the quasi-stationary distribution derived from this equation and the initial distribution through the Kullback-Leibler (KL) divergence. As the quasi-steady-state distribution depends on the expected cost function, the KL divergence eventually reveals the connection between the expected cost function and the initialization distribution. By applying this to deep neural network models (DNNs), we can express the bounds of the expected loss function explicitly in terms of the initialization parameters. Then, by minimizing this bound, we obtain an optimal condition of the initialization variance in the Gaussian case. This result provides a concrete mathematical criterion, rather than a heuristic approach, to select the scale of weight initialization in DNNs. In addition, we experimentally confirm our theoretical results by using the classical SGD to train fully connected neural networks on the MNIST and Fashion-MNIST datasets. The result shows that if the variance of the initialization distribution satisfies our theoretical optimal condition, then the corresponding DNN model always achieves lower final training loss and higher test accuracy than the conventional He-normal initialization. Our work thus supplies a mathematically grounded indicator that guides the choice of initialization variance and clarifies its physical meaning of the dynamics of parameters in DNNs.
Related papers
- From Shallow Bayesian Neural Networks to Gaussian Processes: General Convergence, Identifiability and Scalable Inference [0.0]
We study scaling limits of shallow Bayesian neural networks (BNNs) via their connection to Gaussian processes (GPs)<n>We first establish a general convergence result from BNNs to GPs by relaxing assumptions used in prior formulations, and we compare alternative parameterizations of the limiting GP model.<n>We characterize key properties including positive definiteness and both strict and practical identifiability under different input designs.<n>For computation, we develop a scalable maximum a posterior (MAP) training and prediction procedure using a Nystrm approximation, and we show how the Nystrm rank and anchor selection control the cost-accuracy trade
arXiv Detail & Related papers (2026-02-26T00:02:54Z) - The Procrustean Bed of Time Series: The Optimization Bias of Point-wise Loss [53.542743390809356]
This paper aims to provide a first-principles analysis of the Expectation of Optimization Bias (EOB)<n>Our analysis reveals a fundamental paradigm paradox: the more deterministic and structured the time series, the more severe the bias by point-wise loss function.<n>We present a concrete solution that simultaneously achieves both principles via DFT or DWT.
arXiv Detail & Related papers (2025-12-21T06:08:22Z) - When the Left Foot Leads to the Right Path: Bridging Initial Prejudice and Trainability [0.07767214588770123]
Mean-field (MF) analyses have demonstrated that the parameter distribution in randomly networks dictates whether gradients vanish or explode.<n>In untrained DNNs, large regions of the input space are assigned to a single class.<n>In this work, we derive a theoretical proof establishing the correspondence between IGB and previous MF theories.
arXiv Detail & Related papers (2025-05-17T17:31:56Z) - Chaos into Order: Neural Framework for Expected Value Estimation of Stochastic Partial Differential Equations [0.9944647907864256]
We propose a physics-informed neural framework designed to approximate the expected value of linear partial differential equations (SPDEs)<n>By leveraging randomized sampling of both space-time coordinates and noise realizations during training, LEC trains standard feedforward neural networks to minimize residual loss across multiple samples.<n>We show that the model consistently learns accurate approximations of the expected value of the solution in lower dimensions and a predictable decrease in robustness with increased spatial dimensions.
arXiv Detail & Related papers (2025-02-05T23:27:28Z) - Neural variational Data Assimilation with Uncertainty Quantification using SPDE priors [28.804041716140194]
Recent advances in the deep learning community enables to address the problem through a neural architecture a variational data assimilation framework.<n>In this work we use the theory of Partial Differential Equations (SPDE) and Gaussian Processes (GP) to estimate both space-and time covariance of the state.
arXiv Detail & Related papers (2024-02-02T19:18:12Z) - Enhancing Data-Assimilation in CFD using Graph Neural Networks [0.0]
We present a novel machine learning approach for data assimilation applied in fluid mechanics, based on adjoint-optimization augmented by Graph Neural Networks (GNNs) models.
We obtain our results using direct numerical simulations based on a Finite Element Method (FEM) solver; a two-fold interface between the GNN model and the solver allows the GNN's predictions to be incorporated into post-processing steps of the FEM analysis.
arXiv Detail & Related papers (2023-11-29T19:11:40Z) - Benign Overfitting in Deep Neural Networks under Lazy Training [72.28294823115502]
We show that when the data distribution is well-separated, DNNs can achieve Bayes-optimal test error for classification.
Our results indicate that interpolating with smoother functions leads to better generalization.
arXiv Detail & Related papers (2023-05-30T19:37:44Z) - Variational Linearized Laplace Approximation for Bayesian Deep Learning [11.22428369342346]
We propose a new method for approximating Linearized Laplace Approximation (LLA) using a variational sparse Gaussian Process (GP)
Our method is based on the dual RKHS formulation of GPs and retains, as the predictive mean, the output of the original DNN.
It allows for efficient optimization, which results in sub-linear training time in the size of the training dataset.
arXiv Detail & Related papers (2023-02-24T10:32:30Z) - NAG-GS: Semi-Implicit, Accelerated and Robust Stochastic Optimizer [45.47667026025716]
We propose a novel, robust and accelerated iteration that relies on two key elements.
The convergence and stability of the obtained method, referred to as NAG-GS, are first studied extensively.
We show that NAG-arity is competitive with state-the-art methods such as momentum SGD with weight decay and AdamW for the training of machine learning models.
arXiv Detail & Related papers (2022-09-29T16:54:53Z) - Stability and Generalization Analysis of Gradient Methods for Shallow
Neural Networks [59.142826407441106]
We study the generalization behavior of shallow neural networks (SNNs) by leveraging the concept of algorithmic stability.
We consider gradient descent (GD) and gradient descent (SGD) to train SNNs, for both of which we develop consistent excess bounds.
arXiv Detail & Related papers (2022-09-19T18:48:00Z) - Fractal Structure and Generalization Properties of Stochastic
Optimization Algorithms [71.62575565990502]
We prove that the generalization error of an optimization algorithm can be bounded on the complexity' of the fractal structure that underlies its generalization measure.
We further specialize our results to specific problems (e.g., linear/logistic regression, one hidden/layered neural networks) and algorithms.
arXiv Detail & Related papers (2021-06-09T08:05:36Z) - Improving predictions of Bayesian neural nets via local linearization [79.21517734364093]
We argue that the Gauss-Newton approximation should be understood as a local linearization of the underlying Bayesian neural network (BNN)
Because we use this linearized model for posterior inference, we should also predict using this modified model instead of the original one.
We refer to this modified predictive as "GLM predictive" and show that it effectively resolves common underfitting problems of the Laplace approximation.
arXiv Detail & Related papers (2020-08-19T12:35:55Z) - Revisiting Initialization of Neural Networks [72.24615341588846]
We propose a rigorous estimation of the global curvature of weights across layers by approximating and controlling the norm of their Hessian matrix.
Our experiments on Word2Vec and the MNIST/CIFAR image classification tasks confirm that tracking the Hessian norm is a useful diagnostic tool.
arXiv Detail & Related papers (2020-04-20T18:12:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.