Related papers: SGD Distributional Dynamics of Three Layer Neural Networks

SGD Distributional Dynamics of Three Layer Neural Networks

URL: http://arxiv.org/abs/2012.15036v1
Date: Wed, 30 Dec 2020 04:37:09 GMT
Title: SGD Distributional Dynamics of Three Layer Neural Networks
Authors: Victor Luo, Yazhen Wang and Glenn Fung
Abstract summary: In paper, we seek to extend the mean field results of Mei et al. from two neural networks with one hidden layer to three neural networks with two hidden layers. We will show that the SGD is captured by a set of non-linear differential equations, and prove that distributions of dynamics in the two layers are independent.
Score: 7.025709586759655
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: With the rise of big data analytics, multi-layer neural networks have surfaced as one of the most powerful machine learning methods. However, their theoretical mathematical properties are still not fully understood. Training a neural network requires optimizing a non-convex objective function, typically done using stochastic gradient descent (SGD). In this paper, we seek to extend the mean field results of Mei et al. (2018) from two-layer neural networks with one hidden layer to three-layer neural networks with two hidden layers. We will show that the SGD dynamics is captured by a set of non-linear partial differential equations, and prove that the distributions of weights in the two hidden layers are independent. We will also detail exploratory work done based on simulation and real-world data.

Related papers

Statistical mechanics of extensive-width Bayesian neural networks near interpolation [4.976898227858662]
We study the supervised learning of a two-layer fully connected network with generic weight distribution and activation function.<n>We focus on the Bayes-optimal learning in the teacher-student scenario, i.e., with a dataset generated by another network with the same architecture.<n>Our analysis uncovers a rich phenomenology with various learning transitions as the number of data increases.
arXiv Detail & Related papers (2025-05-30T17:46:59Z)
Global Convergence and Rich Feature Learning in $L$-Layer Infinite-Width Neural Networks under $μ$P Parametrization [66.03821840425539]
In this paper, we investigate the training dynamics of $L$-layer neural networks using the tensor gradient program (SGD) framework. We show that SGD enables these networks to learn linearly independent features that substantially deviate from their initial values. This rich feature space captures relevant data information and ensures that any convergent point of the training process is a global minimum.
arXiv Detail & Related papers (2025-03-12T17:33:13Z)
An Analysis Framework for Understanding Deep Neural Networks Based on Network Dynamics [11.44947569206928]
Deep neural networks (DNNs) maximize information extraction by rationally allocating the proportion of neurons in different modes across deep layers. This framework provides a unified explanation for fundamental DNN behaviors such as the "flat minima effect," "grokking," and double descent phenomena.
arXiv Detail & Related papers (2025-01-05T04:23:21Z)
Enhancing lattice kinetic schemes for fluid dynamics with Lattice-Equivariant Neural Networks [79.16635054977068]
We present a new class of equivariant neural networks, dubbed Lattice-Equivariant Neural Networks (LENNs) Our approach develops within a recently introduced framework aimed at learning neural network-based surrogate models Lattice Boltzmann collision operators. Our work opens towards practical utilization of machine learning-augmented Lattice Boltzmann CFD in real-world simulations.
arXiv Detail & Related papers (2024-05-22T17:23:15Z)
Improved Generalization of Weight Space Networks via Augmentations [53.87011906358727]
Learning in deep weight spaces (DWS) is an emerging research direction, with applications to 2D and 3D neural fields (INRs, NeRFs) We empirically analyze the reasons for this overfitting and find that a key reason is the lack of diversity in DWS datasets. To address this, we explore strategies for data augmentation in weight spaces and propose a MixUp method adapted for weight spaces.
arXiv Detail & Related papers (2024-02-06T15:34:44Z)
Beyond IID weights: sparse and low-rank deep Neural Networks are also Gaussian Processes [3.686808512438363]
We extend the proof of Matthews et al. to a larger class of initial weight distributions. We show that fully-connected and convolutional networks with PSEUDO-IID distributions are all effectively equivalent up to their variance. Using our results, one can identify the Edge-of-Chaos for a broader class of neural networks and tune them at criticality in order to enhance their training.
arXiv Detail & Related papers (2023-10-25T12:38:36Z)
Addressing caveats of neural persistence with deep graph persistence [54.424983583720675]
We find that the variance of network weights and spatial concentration of large weights are the main factors that impact neural persistence. We propose an extension of the filtration underlying neural persistence to the whole neural network instead of single layers. This yields our deep graph persistence measure, which implicitly incorporates persistent paths through the network and alleviates variance-related issues.
arXiv Detail & Related papers (2023-07-20T13:34:11Z)
Provable Guarantees for Nonlinear Feature Learning in Three-Layer Neural Networks [49.808194368781095]
We show that three-layer neural networks have provably richer feature learning capabilities than two-layer networks. This work makes progress towards understanding the provable benefit of three-layer neural networks over two-layer networks in the feature learning regime.
arXiv Detail & Related papers (2023-05-11T17:19:30Z)
Exploring the Approximation Capabilities of Multiplicative Neural Networks for Smooth Functions [9.936974568429173]
We consider two classes of target functions: generalized bandlimited functions and Sobolev-Type balls. Our results demonstrate that multiplicative neural networks can approximate these functions with significantly fewer layers and neurons. These findings suggest that multiplicative gates can outperform standard feed-forward layers and have potential for improving neural network design.
arXiv Detail & Related papers (2023-01-11T17:57:33Z)
A Local Geometric Interpretation of Feature Extraction in Deep Feedforward Neural Networks [13.159994710917022]
In this paper, we present a local geometric analysis to interpret how deep feedforward neural networks extract low-dimensional features from high-dimensional data. Our study shows that, in a local geometric region, the optimal weight in one layer of the neural network and the optimal feature generated by the previous layer comprise a low-rank approximation of a matrix that is determined by the Bayes action of this layer.
arXiv Detail & Related papers (2022-02-09T18:50:00Z)
Optimization-Based Separations for Neural Networks [57.875347246373956]
We show that gradient descent can efficiently learn ball indicator functions using a depth 2 neural network with two layers of sigmoidal activations. This is the first optimization-based separation result where the approximation benefits of the stronger architecture provably manifest in practice.
arXiv Detail & Related papers (2021-12-04T18:07:47Z)
How Neural Networks Extrapolate: From Feedforward to Graph Neural Networks [80.55378250013496]
We study how neural networks trained by gradient descent extrapolate what they learn outside the support of the training distribution. Graph Neural Networks (GNNs) have shown some success in more complex tasks.
arXiv Detail & Related papers (2020-09-24T17:48:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.