Transition to Linearity of General Neural Networks with Directed Acyclic
Graph Architecture
- URL: http://arxiv.org/abs/2205.11786v2
- Date: Wed, 7 Jun 2023 22:20:05 GMT
- Title: Transition to Linearity of General Neural Networks with Directed Acyclic
Graph Architecture
- Authors: Libin Zhu, Chaoyue Liu, Mikhail Belkin
- Abstract summary: We show that feedforward neural networks corresponding to arbitrary directed acyclic graphs undergo transition to linearity as their "width" approaches infinity.
Our results identify the mathematical structure underlying transition to linearity and generalize a number of recent works aimed at characterizing transition to linearity or constancy of the Neural Tangent Kernel for standard architectures.
- Score: 20.44438519046223
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper we show that feedforward neural networks corresponding to
arbitrary directed acyclic graphs undergo transition to linearity as their
"width" approaches infinity. The width of these general networks is
characterized by the minimum in-degree of their neurons, except for the input
and first layers. Our results identify the mathematical structure underlying
transition to linearity and generalize a number of recent works aimed at
characterizing transition to linearity or constancy of the Neural Tangent
Kernel for standard architectures.
Related papers
- Curve Your Attention: Mixed-Curvature Transformers for Graph
Representation Learning [77.1421343649344]
We propose a generalization of Transformers towards operating entirely on the product of constant curvature spaces.
We also provide a kernelized approach to non-Euclidean attention, which enables our model to run in time and memory cost linear to the number of nodes and edges.
arXiv Detail & Related papers (2023-09-08T02:44:37Z) - A Derivation of Feedforward Neural Network Gradients Using Fr\'echet
Calculus [0.0]
We show a derivation of the gradients of feedforward neural networks using Fr'teche calculus.
We show how our analysis generalizes to more general neural network architectures including, but not limited to, convolutional networks.
arXiv Detail & Related papers (2022-09-27T08:14:00Z) - On the Effective Number of Linear Regions in Shallow Univariate ReLU
Networks: Convergence Guarantees and Implicit Bias [50.84569563188485]
We show that gradient flow converges in direction when labels are determined by the sign of a target network with $r$ neurons.
Our result may already hold for mild over- parameterization, where the width is $tildemathcalO(r)$ and independent of the sample size.
arXiv Detail & Related papers (2022-05-18T16:57:10Z) - Transition to Linearity of Wide Neural Networks is an Emerging Property
of Assembling Weak Models [20.44438519046223]
Wide neural networks with linear output layer have been shown to be near-linear, and to have near-constant neural tangent kernel (NTK)
We show that the linearity of wide neural networks is, in fact, an emerging property of assembling a large number of diverse "weak" sub-models, none of which dominate the assembly.
arXiv Detail & Related papers (2022-03-10T01:27:01Z) - Hyperbolic Graph Neural Networks: A Review of Methods and Applications [55.5502008501764]
Graph neural networks generalize conventional neural networks to graph-structured data.
The performance of Euclidean models in graph-related learning is still bounded and limited by the representation ability of Euclidean geometry.
Recently, hyperbolic space has gained increasing popularity in processing graph data with tree-like structure and power-law distribution.
arXiv Detail & Related papers (2022-02-28T15:08:48Z) - Sheaf Neural Networks [2.4366811507669124]
We present a generalization of graph convolutional networks by generalizing the diffusion operation underlying this class of networks.
We show that the resulting sheaf neural networks can outperform graph convolutional networks in domains where relations between nodes are asymmetric and signed.
arXiv Detail & Related papers (2020-12-08T01:19:48Z) - A Unifying View on Implicit Bias in Training Linear Neural Networks [31.65006970108761]
We study the implicit bias of gradient flow (i.e., gradient descent with infinitesimal step size) on linear neural network training.
We propose a tensor formulation of neural networks that includes fully-connected, diagonal, and convolutional networks as special cases.
arXiv Detail & Related papers (2020-10-06T06:08:35Z) - Data-Driven Learning of Geometric Scattering Networks [74.3283600072357]
We propose a new graph neural network (GNN) module based on relaxations of recently proposed geometric scattering transforms.
Our learnable geometric scattering (LEGS) module enables adaptive tuning of the wavelets to encourage band-pass features to emerge in learned representations.
arXiv Detail & Related papers (2020-10-06T01:20:27Z) - On the linearity of large non-linear models: when and why the tangent
kernel is constant [20.44438519046223]
We shed light on the remarkable phenomenon of transition to linearity of certain neural networks as their width approaches infinity.
We show that the transition to linearity of the model and, equivalently, constancy of the (neural) tangent kernel (NTK) result from the scaling properties of the norm of the Hessian matrix of the network.
arXiv Detail & Related papers (2020-10-02T16:44:45Z) - Gauge Equivariant Mesh CNNs: Anisotropic convolutions on geometric
graphs [81.12344211998635]
A common approach to define convolutions on meshes is to interpret them as a graph and apply graph convolutional networks (GCNs)
We propose Gauge Equivariant Mesh CNNs which generalize GCNs to apply anisotropic gauge equivariant kernels.
Our experiments validate the significantly improved expressivity of the proposed model over conventional GCNs and other methods.
arXiv Detail & Related papers (2020-03-11T17:21:15Z) - On the Convex Behavior of Deep Neural Networks in Relation to the
Layers' Width [99.24399270311069]
We observe that for wider networks, minimizing the loss with the descent optimization maneuvers through surfaces of positive curvatures at the start and end of training, and close to zero curvatures in between.
In other words, it seems that during crucial parts of the training process, the Hessian in wide networks is dominated by the component G.
arXiv Detail & Related papers (2020-01-14T16:30:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.