A Kernel Perspective of Skip Connections in Convolutional Networks
- URL: http://arxiv.org/abs/2211.14810v1
- Date: Sun, 27 Nov 2022 12:25:54 GMT
- Title: A Kernel Perspective of Skip Connections in Convolutional Networks
- Authors: Daniel Barzilai, Amnon Geifman, Meirav Galun and Ronen Basri
- Abstract summary: We study the properties ofResNets through their Gaussian Process and Neural Tangent kernels.
Our results indicate that with ReLU activation, eigenvalues of these residual kernels decay at a similar rate compared to the same kernels when skip connections are not used.
Our analysis further shows that the matrices obtained by these residual kernels yield favorable condition numbers at finite depths.
- Score: 21.458906138864176
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Over-parameterized residual networks (ResNets) are amongst the most
successful convolutional neural architectures for image processing. Here we
study their properties through their Gaussian Process and Neural Tangent
kernels. We derive explicit formulas for these kernels, analyze their spectra,
and provide bounds on their implied condition numbers. Our results indicate
that (1) with ReLU activation, the eigenvalues of these residual kernels decay
polynomially at a similar rate compared to the same kernels when skip
connections are not used, thus maintaining a similar frequency bias; (2)
however, residual kernels are more locally biased. Our analysis further shows
that the matrices obtained by these residual kernels yield favorable condition
numbers at finite depths than those obtained without the skip connections,
enabling therefore faster convergence of training with gradient descent.
Related papers
- Generalization Bound of Gradient Flow through Training Trajectory and Data-dependent Kernel [55.82768375605861]
We establish a generalization bound for gradient flow that aligns with the classical Rademacher complexity for kernel methods.<n>Unlike static kernels such as NTK, the LPK captures the entire training trajectory, adapting to both data and optimization dynamics.
arXiv Detail & Related papers (2025-06-12T23:17:09Z) - On the Convergence of Irregular Sampling in Reproducing Kernel Hilbert Spaces [0.0]
We discuss approximation properties of kernel regression under minimalistic assumptions on both the kernel and the input data.
We first prove error estimates in the kernel's RKHS norm.
This leads to new results concerning uniform convergence of kernel regression on compact domains.
arXiv Detail & Related papers (2025-04-18T10:57:16Z) - Implicit Bias of Gradient Descent for Two-layer ReLU and Leaky ReLU
Networks on Nearly-orthogonal Data [66.1211659120882]
The implicit bias towards solutions with favorable properties is believed to be a key reason why neural networks trained by gradient-based optimization can generalize well.
While the implicit bias of gradient flow has been widely studied for homogeneous neural networks (including ReLU and leaky ReLU networks), the implicit bias of gradient descent is currently only understood for smooth neural networks.
arXiv Detail & Related papers (2023-10-29T08:47:48Z) - Efficient Bound of Lipschitz Constant for Convolutional Layers by Gram
Iteration [122.51142131506639]
We introduce a precise, fast, and differentiable upper bound for the spectral norm of convolutional layers using circulant matrix theory.
We show through a comprehensive set of experiments that our approach outperforms other state-of-the-art methods in terms of precision, computational cost, and scalability.
It proves highly effective for the Lipschitz regularization of convolutional neural networks, with competitive results against concurrent approaches.
arXiv Detail & Related papers (2023-05-25T15:32:21Z) - Neural Networks with Sparse Activation Induced by Large Bias: Tighter Analysis with Bias-Generalized NTK [86.45209429863858]
We study training one-hidden-layer ReLU networks in the neural tangent kernel (NTK) regime.
We show that the neural networks possess a different limiting kernel which we call textitbias-generalized NTK
We also study various properties of the neural networks with this new kernel.
arXiv Detail & Related papers (2023-01-01T02:11:39Z) - An Empirical Analysis of the Laplace and Neural Tangent Kernels [0.0]
The neural tangent kernel is a kernel function defined over the parameter distribution of an infinite width neural network.
We show that the Laplace kernel and neural tangent kernel share the same kernel Hilbert space in the space of $mathbbSd-1$.
arXiv Detail & Related papers (2022-08-07T16:18:02Z) - Neural Networks as Kernel Learners: The Silent Alignment Effect [86.44610122423994]
Neural networks in the lazy training regime converge to kernel machines.
We show that this can indeed happen due to a phenomenon we term silent alignment.
We also demonstrate that non-whitened data can weaken the silent alignment effect.
arXiv Detail & Related papers (2021-10-29T18:22:46Z) - Spectral Analysis of the Neural Tangent Kernel for Deep Residual
Networks [29.67334658659187]
We show that the eigenfunctions of ResNTK are the spherical harmonics and the eigenvalues decayly with frequency $k$ as $k-d$.
We show, by drawing on the analogy to the Laplace kernel, that depending on the choice of a hyper- parameter that balances between the skip and residual connections ResNTK can either become spiky with depth, as with FC-NTK, or maintain a stable shape.
arXiv Detail & Related papers (2021-04-07T12:35:19Z) - Deep Equals Shallow for ReLU Networks in Kernel Regimes [13.909388235627791]
We show that for ReLU activations, the kernels derived from deep fully-connected networks have essentially the same approximation properties as their shallow two-layer counterpart.
Our main theoretical result relies on characterizing such eigenvalue decays through differentiability properties of the kernel function.
arXiv Detail & Related papers (2020-09-30T02:37:43Z) - On the Similarity between the Laplace and Neural Tangent Kernels [26.371904197642145]
We show that NTK for fully connected networks is closely related to the standard Laplace kernel.
Our results suggest that much insight about neural networks can be obtained from analysis of the well-known Laplace kernel.
arXiv Detail & Related papers (2020-07-03T09:48:23Z) - Optimal Rates for Averaged Stochastic Gradient Descent under Neural
Tangent Kernel Regime [50.510421854168065]
We show that the averaged gradient descent can achieve the minimax optimal convergence rate.
We show that the target function specified by the NTK of a ReLU network can be learned at the optimal convergence rate.
arXiv Detail & Related papers (2020-06-22T14:31:37Z) - Avoiding Kernel Fixed Points: Computing with ELU and GELU Infinite
Networks [12.692279981822011]
We derive the covariance functions of multi-layer perceptrons with exponential linear units (ELU) and Gaussian error linear units (GELU)
We analyse the fixed-point dynamics of iterated kernels corresponding to a broad range of activation functions.
We find that unlike some previously studied neural network kernels, these new kernels exhibit non-trivial fixed-point dynamics.
arXiv Detail & Related papers (2020-02-20T01:25:39Z) - A Generalized Neural Tangent Kernel Analysis for Two-layer Neural
Networks [87.23360438947114]
We show that noisy gradient descent with weight decay can still exhibit a " Kernel-like" behavior.
This implies that the training loss converges linearly up to a certain accuracy.
We also establish a novel generalization error bound for two-layer neural networks trained by noisy gradient descent with weight decay.
arXiv Detail & Related papers (2020-02-10T18:56:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.