Infinite attention: NNGP and NTK for deep attention networks
- URL: http://arxiv.org/abs/2006.10540v1
- Date: Thu, 18 Jun 2020 13:57:01 GMT
- Title: Infinite attention: NNGP and NTK for deep attention networks
- Authors: Jiri Hron and Yasaman Bahri and Jascha Sohl-Dickstein and Roman Novak
- Abstract summary: We identify an equivalence between wide neural networks (NNs) and Gaussian processes (GPs)
We show that unlike single-head attention, which induces non-Gaussian behaviour, multi-head attention architectures behave as GPs as the number of heads tends to infinity.
We introduce new features to the Neural Tangents library allowing applications of NNGP/NTK models, with and without attention, to variable-length sequences.
- Score: 38.55012122588628
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: There is a growing amount of literature on the relationship between wide
neural networks (NNs) and Gaussian processes (GPs), identifying an equivalence
between the two for a variety of NN architectures. This equivalence enables,
for instance, accurate approximation of the behaviour of wide Bayesian NNs
without MCMC or variational approximations, or characterisation of the
distribution of randomly initialised wide NNs optimised by gradient descent
without ever running an optimiser. We provide a rigorous extension of these
results to NNs involving attention layers, showing that unlike single-head
attention, which induces non-Gaussian behaviour, multi-head attention
architectures behave as GPs as the number of heads tends to infinity. We
further discuss the effects of positional encodings and layer normalisation,
and propose modifications of the attention mechanism which lead to improved
results for both finite and infinitely wide NNs. We evaluate attention kernels
empirically, leading to a moderate improvement upon the previous
state-of-the-art on CIFAR-10 for GPs without trainable kernels and advanced
data preprocessing. Finally, we introduce new features to the Neural Tangents
library (Novak et al., 2020) allowing applications of NNGP/NTK models, with and
without attention, to variable-length sequences, with an example on the IMDb
reviews dataset.
Related papers
- Graph Neural Networks Do Not Always Oversmooth [46.57665708260211]
We study oversmoothing in graph convolutional networks (GCNs) by using their Gaussian process (GP) equivalence in the limit of infinitely many hidden features.
We identify a new, nonoversmoothing phase: if the initial weights of the network have sufficiently large variance, GCNs do not oversmooth, and node features remain informative even at large depth.
arXiv Detail & Related papers (2024-06-04T12:47:13Z) - Benign Overfitting in Deep Neural Networks under Lazy Training [72.28294823115502]
We show that when the data distribution is well-separated, DNNs can achieve Bayes-optimal test error for classification.
Our results indicate that interpolating with smoother functions leads to better generalization.
arXiv Detail & Related papers (2023-05-30T19:37:44Z) - Spherical Inducing Features for Orthogonally-Decoupled Gaussian
Processes [7.4468224549568705]
Gaussian processes (GPs) are often compared unfavorably to deep neural networks (NNs) for lacking the ability to learn representations.
Recent efforts to bridge the gap between GPs and deep NNs have yielded a new class of inter-domain variational GPs in which the inducing variables correspond to hidden units of a feedforward NN.
arXiv Detail & Related papers (2023-04-27T09:00:02Z) - Gradient Descent in Neural Networks as Sequential Learning in RKBS [63.011641517977644]
We construct an exact power-series representation of the neural network in a finite neighborhood of the initial weights.
We prove that, regardless of width, the training sequence produced by gradient descent can be exactly replicated by regularized sequential learning.
arXiv Detail & Related papers (2023-02-01T03:18:07Z) - Extrapolation and Spectral Bias of Neural Nets with Hadamard Product: a
Polynomial Net Study [55.12108376616355]
The study on NTK has been devoted to typical neural network architectures, but is incomplete for neural networks with Hadamard products (NNs-Hp)
In this work, we derive the finite-width-K formulation for a special class of NNs-Hp, i.e., neural networks.
We prove their equivalence to the kernel regression predictor with the associated NTK, which expands the application scope of NTK.
arXiv Detail & Related papers (2022-09-16T06:36:06Z) - On Feature Learning in Neural Networks with Global Convergence
Guarantees [49.870593940818715]
We study the optimization of wide neural networks (NNs) via gradient flow (GF)
We show that when the input dimension is no less than the size of the training set, the training loss converges to zero at a linear rate under GF.
We also show empirically that, unlike in the Neural Tangent Kernel (NTK) regime, our multi-layer model exhibits feature learning and can achieve better generalization performance than its NTK counterpart.
arXiv Detail & Related papers (2022-04-22T15:56:43Z) - Deep Stable neural networks: large-width asymptotics and convergence
rates [3.0108936184913295]
We show that as the width goes to infinity jointly over the NN's layers, a suitable rescaled deep Stable NN converges weakly to a Stable SP.
Because of the non-triangular NN's structure, this is a non-standard problem, to which we propose a novel and self-contained inductive approach.
arXiv Detail & Related papers (2021-08-02T12:18:00Z) - Weighted Neural Tangent Kernel: A Generalized and Improved
Network-Induced Kernel [20.84988773171639]
The Neural Tangent Kernel (NTK) has recently attracted intense study, as it describes the evolution of an over- parameterized Neural Network (NN) trained by gradient descent.
We introduce the Weighted Neural Tangent Kernel (WNTK), a generalized and improved tool, which can capture an over- parameterized NN's training dynamics under different gradients.
With the proposed weight update algorithm, both empirical and analytical WNTKs outperform the corresponding NTKs in numerical experiments.
arXiv Detail & Related papers (2021-03-22T03:16:20Z) - Finite Versus Infinite Neural Networks: an Empirical Study [69.07049353209463]
kernel methods outperform fully-connected finite-width networks.
Centered and ensembled finite networks have reduced posterior variance.
Weight decay and the use of a large learning rate break the correspondence between finite and infinite networks.
arXiv Detail & Related papers (2020-07-31T01:57:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.