Spectra of the Conjugate Kernel and Neural Tangent Kernel for
linear-width neural networks
- URL: http://arxiv.org/abs/2005.11879v3
- Date: Sat, 10 Oct 2020 16:47:46 GMT
- Title: Spectra of the Conjugate Kernel and Neural Tangent Kernel for
linear-width neural networks
- Authors: Zhou Fan and Zhichao Wang
- Abstract summary: We study the eigenvalue of the Conjugate Neural Kernel and Tangent Kernel associated to feedforward neural networks.
We show that the eigenvalue distributions of the CK and NTK converge to deterministic limits.
- Score: 22.57374777395746
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study the eigenvalue distributions of the Conjugate Kernel and Neural
Tangent Kernel associated to multi-layer feedforward neural networks. In an
asymptotic regime where network width is increasing linearly in sample size,
under random initialization of the weights, and for input samples satisfying a
notion of approximate pairwise orthogonality, we show that the eigenvalue
distributions of the CK and NTK converge to deterministic limits. The limit for
the CK is described by iterating the Marcenko-Pastur map across the hidden
layers. The limit for the NTK is equivalent to that of a linear combination of
the CK matrices across layers, and may be described by recursive fixed-point
equations that extend this Marcenko-Pastur map. We demonstrate the agreement of
these asymptotic predictions with the observed spectra for both synthetic and
CIFAR-10 training data, and we perform a small simulation to investigate the
evolutions of these spectra over training.
Related papers
- Nonlinear spiked covariance matrices and signal propagation in deep
neural networks [22.84097371842279]
We study the eigenvalue spectrum of the Conjugate Kernel defined by a nonlinear feature map of a feedforward neural network.
In this work, we characterize these signal eigenvalues and eigenvectors for a nonlinear version of the spiked covariance model.
We also study a simple regime of representation learning where the weight matrix develops a rank-one signal component over training.
arXiv Detail & Related papers (2024-02-15T17:31:19Z) - An Infinite-Width Analysis on the Jacobian-Regularised Training of a Neural Network [10.384951432591492]
Recent theoretical analysis of deep neural networks in their infinite-width limits has deepened our understanding of initialisation, feature learning, and training of those networks.
We show that this infinite-width analysis can be extended to the Jacobian of a deep neural network.
We experimentally show the relevance of our theoretical claims to wide finite networks, and empirically analyse the properties of kernel regression solution to obtain an insight into Jacobian regularisation.
arXiv Detail & Related papers (2023-12-06T09:52:18Z) - Neural Tangent Kernels Motivate Graph Neural Networks with
Cross-Covariance Graphs [94.44374472696272]
We investigate NTKs and alignment in the context of graph neural networks (GNNs)
Our results establish the theoretical guarantees on the optimality of the alignment for a two-layer GNN.
These guarantees are characterized by the graph shift operator being a function of the cross-covariance between the input and the output data.
arXiv Detail & Related papers (2023-10-16T19:54:21Z) - Towards Training Without Depth Limits: Batch Normalization Without
Gradient Explosion [83.90492831583997]
We show that a batch-normalized network can keep the optimal signal propagation properties, but avoid exploding gradients in depth.
We use a Multi-Layer Perceptron (MLP) with linear activations and batch-normalization that provably has bounded depth.
We also design an activation shaping scheme that empirically achieves the same properties for certain non-linear activations.
arXiv Detail & Related papers (2023-10-03T12:35:02Z) - Deterministic equivalent of the Conjugate Kernel matrix associated to
Artificial Neural Networks [0.0]
We show that the empirical spectral distribution of the Conjugate Kernel converges to a deterministic limit.
More precisely we obtain a deterministic equivalent for its Stieltjes transform and its resolvent, with quantitative bounds involving both the dimension and the spectral parameter.
arXiv Detail & Related papers (2023-06-09T12:31:59Z) - Efficient Bound of Lipschitz Constant for Convolutional Layers by Gram
Iteration [122.51142131506639]
We introduce a precise, fast, and differentiable upper bound for the spectral norm of convolutional layers using circulant matrix theory.
We show through a comprehensive set of experiments that our approach outperforms other state-of-the-art methods in terms of precision, computational cost, and scalability.
It proves highly effective for the Lipschitz regularization of convolutional neural networks, with competitive results against concurrent approaches.
arXiv Detail & Related papers (2023-05-25T15:32:21Z) - Gradient Descent in Neural Networks as Sequential Learning in RKBS [63.011641517977644]
We construct an exact power-series representation of the neural network in a finite neighborhood of the initial weights.
We prove that, regardless of width, the training sequence produced by gradient descent can be exactly replicated by regularized sequential learning.
arXiv Detail & Related papers (2023-02-01T03:18:07Z) - On the Effective Number of Linear Regions in Shallow Univariate ReLU
Networks: Convergence Guarantees and Implicit Bias [50.84569563188485]
We show that gradient flow converges in direction when labels are determined by the sign of a target network with $r$ neurons.
Our result may already hold for mild over- parameterization, where the width is $tildemathcalO(r)$ and independent of the sample size.
arXiv Detail & Related papers (2022-05-18T16:57:10Z) - Mean-field Analysis of Piecewise Linear Solutions for Wide ReLU Networks [83.58049517083138]
We consider a two-layer ReLU network trained via gradient descent.
We show that SGD is biased towards a simple solution.
We also provide empirical evidence that knots at locations distinct from the data points might occur.
arXiv Detail & Related papers (2021-11-03T15:14:20Z) - Double-descent curves in neural networks: a new perspective using
Gaussian processes [9.153116600213641]
Double-descent curves in neural networks describe the phenomenon that the generalisation error initially descends with increasing parameters, then grows after reaching an optimal number of parameters.
We use techniques from random matrix theory to characterize the spectral distribution of the empirical feature covariance matrix as a width-dependent of the spectrum of the neural network Gaussian process kernel.
arXiv Detail & Related papers (2021-02-14T20:31:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.