Related papers: Avoiding Kernel Fixed Points: Computing with ELU and GELU Infinite Networks

Avoiding Kernel Fixed Points: Computing with ELU and GELU Infinite Networks

URL: http://arxiv.org/abs/2002.08517v3
Date: Mon, 1 Mar 2021 00:43:43 GMT
Title: Avoiding Kernel Fixed Points: Computing with ELU and GELU Infinite Networks
Authors: Russell Tsuchida, Tim Pearce, Chris van der Heide, Fred Roosta, Marcus Gallagher
Abstract summary: We derive the covariance functions of multi-layer perceptrons with exponential linear units (ELU) and Gaussian error linear units (GELU) We analyse the fixed-point dynamics of iterated kernels corresponding to a broad range of activation functions. We find that unlike some previously studied neural network kernels, these new kernels exhibit non-trivial fixed-point dynamics.
Score: 12.692279981822011
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Analysing and computing with Gaussian processes arising from infinitely wide neural networks has recently seen a resurgence in popularity. Despite this, many explicit covariance functions of networks with activation functions used in modern networks remain unknown. Furthermore, while the kernels of deep networks can be computed iteratively, theoretical understanding of deep kernels is lacking, particularly with respect to fixed-point dynamics. Firstly, we derive the covariance functions of multi-layer perceptrons (MLPs) with exponential linear units (ELU) and Gaussian error linear units (GELU) and evaluate the performance of the limiting Gaussian processes on some benchmarks. Secondly, and more generally, we analyse the fixed-point dynamics of iterated kernels corresponding to a broad range of activation functions. We find that unlike some previously studied neural network kernels, these new kernels exhibit non-trivial fixed-point dynamics which are mirrored in finite-width neural networks. The fixed point behaviour present in some networks explains a mechanism for implicit regularisation in overparameterised deep models. Our results relate to both the static iid parameter conjugate kernel and the dynamic neural tangent kernel constructions. Software at github.com/RussellTsuchida/ELU_GELU_kernels.

Related papers

Beyond ReLU: How Activations Affect Neural Kernels and Random Wide Networks [6.1003048508889535]
We provide a more general characterization of the RKHS for typical activation functions whose only non-smoothness is at zero.<n>Our results show that a broad class of not infinitely smooth activations generate equivalent tangents at different network depths, while activations generate non-equivalent RKHSs.
arXiv Detail & Related papers (2025-06-27T17:56:09Z)
Novel Kernel Models and Exact Representor Theory for Neural Networks Beyond the Over-Parameterized Regime [52.00917519626559]
This paper presents two models of neural-networks and their training applicable to neural networks of arbitrary width, depth and topology. We also present an exact novel representor theory for layer-wise neural network training with unregularized gradient descent in terms of a local-extrinsic neural kernel (LeNK) This representor theory gives insight into the role of higher-order statistics in neural network training and the effect of kernel evolution in neural-network kernel models.
arXiv Detail & Related papers (2024-05-24T06:30:36Z)
Scalable Neural Network Kernels [22.299704296356836]
We introduce scalable neural network kernels (SNNKs), capable of approximating regular feedforward layers (FFLs) We also introduce the neural network bundling process that applies SNNKs to compactify deep neural network architectures. Our mechanism provides up to 5x reduction in the number of trainable parameters, while maintaining competitive accuracy.
arXiv Detail & Related papers (2023-10-20T02:12:56Z)
Faithful and Efficient Explanations for Neural Networks via Neural Tangent Kernel Surrogate Models [7.608408123113268]
We analyze approximate empirical neural tangent kernels (eNTK) for data attribution. We introduce two new random projection variants of approximate eNTK which allow users to tune the time and memory complexity of their calculation. We conclude that kernel machines using approximate neural tangent kernel as the kernel function are effective surrogate models.
arXiv Detail & Related papers (2023-05-23T23:51:53Z)
Approximation by non-symmetric networks for cross-domain learning [0.0]
We study the approximation capabilities of kernel based networks using non-symmetric kernels. We obtain estimates on the accuracy of uniform approximation of functions in a Sobolev class by ReLU$r$ networks when $r$ is not necessarily an integer.
arXiv Detail & Related papers (2023-05-06T01:33:26Z)
On the Eigenvalue Decay Rates of a Class of Neural-Network Related Kernel Functions Defined on General Domains [10.360517127652185]
We provide a strategy to determine the eigenvalue decay rate (EDR) of a large class of kernel functions defined on a general domain. This class of kernel functions include but are not limited to the neural tangent kernel associated with neural networks with different depths and various activation functions.
arXiv Detail & Related papers (2023-05-04T08:54:40Z)
Gradient Descent in Neural Networks as Sequential Learning in RKBS [63.011641517977644]
We construct an exact power-series representation of the neural network in a finite neighborhood of the initial weights. We prove that, regardless of width, the training sequence produced by gradient descent can be exactly replicated by regularized sequential learning.
arXiv Detail & Related papers (2023-02-01T03:18:07Z)
Neural Networks with Sparse Activation Induced by Large Bias: Tighter Analysis with Bias-Generalized NTK [86.45209429863858]
We study training one-hidden-layer ReLU networks in the neural tangent kernel (NTK) regime. We show that the neural networks possess a different limiting kernel which we call textitbias-generalized NTK We also study various properties of the neural networks with this new kernel.
arXiv Detail & Related papers (2023-01-01T02:11:39Z)
Analysis of Structured Deep Kernel Networks [0.0]
We show that the use of special types of kernels yields models reminiscent of neural networks founded in the same theoretical framework of classical kernel methods. Especially the introduced Structured Deep Kernel Networks (SDKNs) can be viewed as unbounded neural networks (NNs) with optimizable activation functions obeying a representer theorem.
arXiv Detail & Related papers (2021-05-15T14:10:35Z)
Random Features for the Neural Tangent Kernel [57.132634274795066]
We propose an efficient feature map construction of the Neural Tangent Kernel (NTK) of fully-connected ReLU network. We show that dimension of the resulting features is much smaller than other baseline feature map constructions to achieve comparable error bounds both in theory and practice.
arXiv Detail & Related papers (2021-04-03T09:08:12Z)
Classifying high-dimensional Gaussian mixtures: Where kernel methods fail and neural networks succeed [27.38015169185521]
We show theoretically that two-layer neural networks (2LNN) with only a few hidden neurons can beat the performance of kernel learning. We show how over-parametrising the neural network leads to faster convergence, but does not improve its final performance.
arXiv Detail & Related papers (2021-02-23T15:10:15Z)
Finite Versus Infinite Neural Networks: an Empirical Study [69.07049353209463]
kernel methods outperform fully-connected finite-width networks. Centered and ensembled finite networks have reduced posterior variance. Weight decay and the use of a large learning rate break the correspondence between finite and infinite networks.
arXiv Detail & Related papers (2020-07-31T01:57:47Z)
Multipole Graph Neural Operator for Parametric Partial Differential Equations [57.90284928158383]
One of the main challenges in using deep learning-based methods for simulating physical systems is formulating physics-based data. We propose a novel multi-level graph neural network framework that captures interaction at all ranges with only linear complexity. Experiments confirm our multi-graph network learns discretization-invariant solution operators to PDEs and can be evaluated in linear time.
arXiv Detail & Related papers (2020-06-16T21:56:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.