Related papers: Spectra of the Conjugate Kernel and Neural Tangent Kernel for linear-width neural networks

Spectra of the Conjugate Kernel and Neural Tangent Kernel for linear-width neural networks

URL: http://arxiv.org/abs/2005.11879v3
Date: Sat, 10 Oct 2020 16:47:46 GMT
Title: Spectra of the Conjugate Kernel and Neural Tangent Kernel for linear-width neural networks
Authors: Zhou Fan and Zhichao Wang
Abstract summary: We study the eigenvalue of the Conjugate Neural Kernel and Tangent Kernel associated to feedforward neural networks. We show that the eigenvalue distributions of the CK and NTK converge to deterministic limits.
Score: 22.57374777395746
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We study the eigenvalue distributions of the Conjugate Kernel and Neural Tangent Kernel associated to multi-layer feedforward neural networks. In an asymptotic regime where network width is increasing linearly in sample size, under random initialization of the weights, and for input samples satisfying a notion of approximate pairwise orthogonality, we show that the eigenvalue distributions of the CK and NTK converge to deterministic limits. The limit for the CK is described by iterating the Marcenko-Pastur map across the hidden layers. The limit for the NTK is equivalent to that of a linear combination of the CK matrices across layers, and may be described by recursive fixed-point equations that extend this Marcenko-Pastur map. We demonstrate the agreement of these asymptotic predictions with the observed spectra for both synthetic and CIFAR-10 training data, and we perform a small simulation to investigate the evolutions of these spectra over training.

Related papers

Nonlinear spiked covariance matrices and signal propagation in deep neural networks [22.84097371842279]
We study the eigenvalue spectrum of the Conjugate Kernel defined by a nonlinear feature map of a feedforward neural network. In this work, we characterize these signal eigenvalues and eigenvectors for a nonlinear version of the spiked covariance model. We also study a simple regime of representation learning where the weight matrix develops a rank-one signal component over training.
arXiv Detail & Related papers (2024-02-15T17:31:19Z)
An Infinite-Width Analysis on the Jacobian-Regularised Training of a Neural Network [10.384951432591492]
Recent theoretical analysis of deep neural networks in their infinite-width limits has deepened our understanding of initialisation, feature learning, and training of those networks. We show that this infinite-width analysis can be extended to the Jacobian of a deep neural network. We experimentally show the relevance of our theoretical claims to wide finite networks, and empirically analyse the properties of kernel regression solution to obtain an insight into Jacobian regularisation.
arXiv Detail & Related papers (2023-12-06T09:52:18Z)
Neural Tangent Kernels Motivate Graph Neural Networks with Cross-Covariance Graphs [94.44374472696272]
We investigate NTKs and alignment in the context of graph neural networks (GNNs) Our results establish the theoretical guarantees on the optimality of the alignment for a two-layer GNN. These guarantees are characterized by the graph shift operator being a function of the cross-covariance between the input and the output data.
arXiv Detail & Related papers (2023-10-16T19:54:21Z)
Towards Training Without Depth Limits: Batch Normalization Without Gradient Explosion [83.90492831583997]
We show that a batch-normalized network can keep the optimal signal propagation properties, but avoid exploding gradients in depth. We use a Multi-Layer Perceptron (MLP) with linear activations and batch-normalization that provably has bounded depth. We also design an activation shaping scheme that empirically achieves the same properties for certain non-linear activations.
arXiv Detail & Related papers (2023-10-03T12:35:02Z)
Deterministic equivalent of the Conjugate Kernel matrix associated to Artificial Neural Networks [0.0]
We show that the empirical spectral distribution of the Conjugate Kernel converges to a deterministic limit. More precisely we obtain a deterministic equivalent for its Stieltjes transform and its resolvent, with quantitative bounds involving both the dimension and the spectral parameter.
arXiv Detail & Related papers (2023-06-09T12:31:59Z)
Efficient Bound of Lipschitz Constant for Convolutional Layers by Gram Iteration [122.51142131506639]
We introduce a precise, fast, and differentiable upper bound for the spectral norm of convolutional layers using circulant matrix theory. We show through a comprehensive set of experiments that our approach outperforms other state-of-the-art methods in terms of precision, computational cost, and scalability. It proves highly effective for the Lipschitz regularization of convolutional neural networks, with competitive results against concurrent approaches.
arXiv Detail & Related papers (2023-05-25T15:32:21Z)
Gradient Descent in Neural Networks as Sequential Learning in RKBS [63.011641517977644]
We construct an exact power-series representation of the neural network in a finite neighborhood of the initial weights. We prove that, regardless of width, the training sequence produced by gradient descent can be exactly replicated by regularized sequential learning.
arXiv Detail & Related papers (2023-02-01T03:18:07Z)
On the Effective Number of Linear Regions in Shallow Univariate ReLU Networks: Convergence Guarantees and Implicit Bias [50.84569563188485]
We show that gradient flow converges in direction when labels are determined by the sign of a target network with $r$ neurons. Our result may already hold for mild over- parameterization, where the width is $tildemathcalO(r)$ and independent of the sample size.
arXiv Detail & Related papers (2022-05-18T16:57:10Z)
Mean-field Analysis of Piecewise Linear Solutions for Wide ReLU Networks [83.58049517083138]
We consider a two-layer ReLU network trained via gradient descent. We show that SGD is biased towards a simple solution. We also provide empirical evidence that knots at locations distinct from the data points might occur.
arXiv Detail & Related papers (2021-11-03T15:14:20Z)
Double-descent curves in neural networks: a new perspective using Gaussian processes [9.153116600213641]
Double-descent curves in neural networks describe the phenomenon that the generalisation error initially descends with increasing parameters, then grows after reaching an optimal number of parameters. We use techniques from random matrix theory to characterize the spectral distribution of the empirical feature covariance matrix as a width-dependent of the spectrum of the neural network Gaussian process kernel.
arXiv Detail & Related papers (2021-02-14T20:31:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.