Quantitative convergence of trained single layer neural networks to Gaussian processes
- URL: http://arxiv.org/abs/2509.24544v2
- Date: Thu, 23 Oct 2025 10:27:48 GMT
- Title: Quantitative convergence of trained single layer neural networks to Gaussian processes
- Authors: Eloy Mosig, Andrea Agazzi, Dario Trevisan,
- Abstract summary: We study the quantitative convergence of shallow neural networks trained via gradient descent to their associated Gaussian processes in the infinite-width limit.<n>We provide explicit upper bounds on the quadratic Wasserstein distance between the network output and its Gaussian approximation at any training time.<n>Our results quantify how architectural parameters, such as width and input dimension, influence convergence, and how training dynamics affect the approximation error.
- Score: 3.441021278275805
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we study the quantitative convergence of shallow neural networks trained via gradient descent to their associated Gaussian processes in the infinite-width limit. While previous work has established qualitative convergence under broad settings, precise, finite-width estimates remain limited, particularly during training. We provide explicit upper bounds on the quadratic Wasserstein distance between the network output and its Gaussian approximation at any training time $t \ge 0$, demonstrating polynomial decay with network width. Our results quantify how architectural parameters, such as width and input dimension, influence convergence, and how training dynamics affect the approximation error.
Related papers
- Layer-wise Quantization for Quantized Optimistic Dual Averaging [75.4148236967503]
We develop a general layer-wise quantization framework with tight variance and code-length bounds, adapting to the heterogeneities over the course of training.<n>We propose a novel Quantized Optimistic Dual Averaging (QODA) algorithm with adaptive learning rates, which achieves competitive convergence rates for monotone VIs.
arXiv Detail & Related papers (2025-05-20T13:53:58Z) - Asymptotics of Learning with Deep Structured (Random) Features [9.366617422860543]
For a large class of feature maps we provide a tight characterisation of the test error associated with learning the readout layer.
In some cases our results can capture feature maps learned by deep, finite-width neural networks trained under gradient descent.
arXiv Detail & Related papers (2024-02-21T18:35:27Z) - Wide Deep Neural Networks with Gaussian Weights are Very Close to
Gaussian Processes [1.0878040851638]
We show that the distance between the network output and the corresponding Gaussian approximation scales inversely with the width of the network, exhibiting faster convergence than the naive suggested by the central limit theorem.
We also apply our bounds to obtain theoretical approximations for the exact posterior distribution of the network, when the likelihood is a bounded Lipschitz function of the network output evaluated on a (finite) training set.
arXiv Detail & Related papers (2023-12-18T22:29:40Z) - Gradient Descent in Neural Networks as Sequential Learning in RKBS [63.011641517977644]
We construct an exact power-series representation of the neural network in a finite neighborhood of the initial weights.
We prove that, regardless of width, the training sequence produced by gradient descent can be exactly replicated by regularized sequential learning.
arXiv Detail & Related papers (2023-02-01T03:18:07Z) - Proximal Mean Field Learning in Shallow Neural Networks [0.4972323953932129]
We propose a custom learning algorithm for shallow neural networks with single hidden layer having infinite width.
We realize mean field learning as a computational algorithm, rather than as an analytical tool.
Our algorithm performs gradient descent of the free energy associated with the risk functional.
arXiv Detail & Related papers (2022-10-25T10:06:26Z) - Deep Architecture Connectivity Matters for Its Convergence: A
Fine-Grained Analysis [94.64007376939735]
We theoretically characterize the impact of connectivity patterns on the convergence of deep neural networks (DNNs) under gradient descent training.
We show that by a simple filtration on "unpromising" connectivity patterns, we can trim down the number of models to evaluate.
arXiv Detail & Related papers (2022-05-11T17:43:54Z) - Convergence and Implicit Regularization Properties of Gradient Descent
for Deep Residual Networks [7.090165638014331]
We prove linear convergence of gradient descent to a global minimum for the training of deep residual networks with constant layer width and smooth activation function.
We show that the trained weights, as a function of the layer index, admits a scaling limit which is H"older continuous as the depth of the network tends to infinity.
arXiv Detail & Related papers (2022-04-14T22:50:28Z) - On the Neural Tangent Kernel Analysis of Randomly Pruned Neural Networks [91.3755431537592]
We study how random pruning of the weights affects a neural network's neural kernel (NTK)
In particular, this work establishes an equivalence of the NTKs between a fully-connected neural network and its randomly pruned version.
arXiv Detail & Related papers (2022-03-27T15:22:19Z) - Unified Field Theory for Deep and Recurrent Neural Networks [56.735884560668985]
We present a unified and systematic derivation of the mean-field theory for both recurrent and deep networks.
We find that convergence towards the mean-field theory is typically slower for recurrent networks than for deep networks.
Our method exposes that Gaussian processes are but the lowest order of a systematic expansion in $1/n$.
arXiv Detail & Related papers (2021-12-10T15:06:11Z) - A Convergence Theory Towards Practical Over-parameterized Deep Neural
Networks [56.084798078072396]
We take a step towards closing the gap between theory and practice by significantly improving the known theoretical bounds on both the network width and the convergence time.
We show that convergence to a global minimum is guaranteed for networks with quadratic widths in the sample size and linear in their depth at a time logarithmic in both.
Our analysis and convergence bounds are derived via the construction of a surrogate network with fixed activation patterns that can be transformed at any time to an equivalent ReLU network of a reasonable size.
arXiv Detail & Related papers (2021-01-12T00:40:45Z) - Bayesian Deep Ensembles via the Neural Tangent Kernel [49.569912265882124]
We explore the link between deep ensembles and Gaussian processes (GPs) through the lens of the Neural Tangent Kernel (NTK)
We introduce a simple modification to standard deep ensembles training, through addition of a computationally-tractable, randomised and untrainable function to each ensemble member.
We prove that our Bayesian deep ensembles make more conservative predictions than standard deep ensembles in the infinite width limit.
arXiv Detail & Related papers (2020-07-11T22:10:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.