Related papers: On the validity of kernel approximations for orthogonally-initialized neural networks

Related papers

Genus expansion for non-linear random matrix ensembles with applications to neural networks [3.801509221714223]
We present a unified approach to studying certain non-linear random matrix ensembles and associated random neural networks.<n>We use a novel series expansion for neural networks which generalizes Fa'a di Bruno's formula to an arbitrary number of compositions.<n>As an application, we prove several results about neural networks with random weights.
arXiv Detail & Related papers (2024-07-11T12:58:07Z)
Novel Kernel Models and Exact Representor Theory for Neural Networks Beyond the Over-Parameterized Regime [52.00917519626559]
This paper presents two models of neural-networks and their training applicable to neural networks of arbitrary width, depth and topology. We also present an exact novel representor theory for layer-wise neural network training with unregularized gradient descent in terms of a local-extrinsic neural kernel (LeNK) This representor theory gives insight into the role of higher-order statistics in neural network training and the effect of kernel evolution in neural-network kernel models.
arXiv Detail & Related papers (2024-05-24T06:30:36Z)
An Exact Kernel Equivalence for Finite Classification Models [1.4777718769290527]
We compare our exact representation to the well-known Neural Tangent Kernel (NTK) and discuss approximation error relative to the NTK. We use this exact kernel to show that our theoretical contribution can provide useful insights into the predictions made by neural networks.
arXiv Detail & Related papers (2023-08-01T20:22:53Z)
Promises and Pitfalls of the Linearized Laplace in Bayesian Optimization [73.80101701431103]
The linearized-Laplace approximation (LLA) has been shown to be effective and efficient in constructing Bayesian neural networks. We study the usefulness of the LLA in Bayesian optimization and highlight its strong performance and flexibility.
arXiv Detail & Related papers (2023-04-17T14:23:43Z)
Simple initialization and parametrization of sinusoidal networks via their kernel bandwidth [92.25666446274188]
sinusoidal neural networks with activations have been proposed as an alternative to networks with traditional activation functions. We first propose a simplified version of such sinusoidal neural networks, which allows both for easier practical implementation and simpler theoretical analysis. We then analyze the behavior of these networks from the neural tangent kernel perspective and demonstrate that their kernel approximates a low-pass filter with an adjustable bandwidth.
arXiv Detail & Related papers (2022-11-26T07:41:48Z)
Coordinate descent on the orthogonal group for recurrent neural network training [9.886326127330337]
We show that the algorithm rotates two columns of the recurrent matrix, an operation that can be efficiently implemented as a multiplication by a Givens matrix. Experiments on a benchmark recurrent neural network training problem are presented to demonstrate the effectiveness of the proposed algorithm.
arXiv Detail & Related papers (2021-07-30T19:27:11Z)
Double-descent curves in neural networks: a new perspective using Gaussian processes [9.153116600213641]
Double-descent curves in neural networks describe the phenomenon that the generalisation error initially descends with increasing parameters, then grows after reaching an optimal number of parameters. We use techniques from random matrix theory to characterize the spectral distribution of the empirical feature covariance matrix as a width-dependent of the spectrum of the neural network Gaussian process kernel.
arXiv Detail & Related papers (2021-02-14T20:31:49Z)
Generalized Leverage Score Sampling for Neural Networks [82.95180314408205]
Leverage score sampling is a powerful technique that originates from theoretical computer science. In this work, we generalize the results in [Avron, Kapralov, Musco, Musco, Velingker and Zandieh 17] to a broader class of kernels.
arXiv Detail & Related papers (2020-09-21T14:46:01Z)
RicciNets: Curvature-guided Pruning of High-performance Neural Networks Using Ricci Flow [0.0]
We use the definition of Ricci curvature to remove edges of low importance before mapping the computational graph to a neural network. We show a reduction of almost $35%$ in the number of floating-point operations (FLOPs) per pass, with no degradation in performance.
arXiv Detail & Related papers (2020-07-08T15:56:02Z)
Tractable Approximate Gaussian Inference for Bayesian Neural Networks [1.933681537640272]
We propose an analytical method for performing tractable approximate Gaussian inference (TAGI) in Bayesian neural networks. The method has a computational complexity of $mathcalO(n)$ with respect to the number of parameters $n$, and the tests performed on regression and classification benchmarks confirm that, for a same network architecture, it matches the performance of existing methods relying on gradient backpropagation.
arXiv Detail & Related papers (2020-04-20T13:37:08Z)
Controllable Orthogonalization in Training DNNs [96.1365404059924]
Orthogonality is widely used for training deep neural networks (DNNs) due to its ability to maintain all singular values of the Jacobian close to 1. This paper proposes a computationally efficient and numerically stable orthogonalization method using Newton's iteration (ONI) We show that our method improves the performance of image classification networks by effectively controlling the orthogonality to provide an optimal tradeoff between optimization benefits and representational capacity reduction. We also show that ONI stabilizes the training of generative adversarial networks (GANs) by maintaining the Lipschitz continuity of a network, similar to spectral normalization (
arXiv Detail & Related papers (2020-04-02T10:14:27Z)
On the Convex Behavior of Deep Neural Networks in Relation to the Layers' Width [99.24399270311069]
We observe that for wider networks, minimizing the loss with the descent optimization maneuvers through surfaces of positive curvatures at the start and end of training, and close to zero curvatures in between. In other words, it seems that during crucial parts of the training process, the Hessian in wide networks is dominated by the component G.
arXiv Detail & Related papers (2020-01-14T16:30:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.