On Learning the Transformer Kernel
- URL: http://arxiv.org/abs/2110.08323v1
- Date: Fri, 15 Oct 2021 19:20:25 GMT
- Title: On Learning the Transformer Kernel
- Authors: Sankalan Pal Chowdhury, Adamos Solomou, Avinava Dubey and Mrinmaya
Sachan
- Abstract summary: KERNELIZED TRANSFORMER is a generic, scalable, data driven framework for learning the kernel function in Transformers.
Our framework approximates the Transformer kernel as a dot product between spectral feature maps and learns the kernel by learning the spectral distribution.
- Score: 13.955526058823166
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: In this work we introduce KERNELIZED TRANSFORMER, a generic, scalable, data
driven framework for learning the kernel function in Transformers. Our
framework approximates the Transformer kernel as a dot product between spectral
feature maps and learns the kernel by learning the spectral distribution. This
not only helps in learning a generic kernel end-to-end, but also reduces the
time and space complexity of Transformers from quadratic to linear. We show
that KERNELIZED TRANSFORMERS achieve performance comparable to existing
efficient Transformer architectures, both in terms of accuracy as well as
computational efficiency. Our study also demonstrates that the choice of the
kernel has a substantial impact on performance, and kernel learning variants
are competitive alternatives to fixed kernel Transformers, both in long as well
as short sequence tasks.
Related papers
- Amortized Inference for Gaussian Process Hyperparameters of Structured
Kernels [5.1672267755831705]
Amortizing parameter inference over different datasets is a promising approach to dramatically speed up training time.
We propose amortizing kernel parameter inference over a complete kernel-structure-family rather than a fixed kernel structure.
We show drastically reduced inference time combined with competitive test performance for a large set of kernels and datasets.
arXiv Detail & Related papers (2023-06-16T13:02:57Z) - RWKV: Reinventing RNNs for the Transformer Era [54.716108899349614]
We propose a novel model architecture that combines the efficient parallelizable training of transformers with the efficient inference of RNNs.
We scale our models as large as 14 billion parameters, by far the largest dense RNN ever trained, and find RWKV performs on par with similarly sized Transformers.
arXiv Detail & Related papers (2023-05-22T13:57:41Z) - ParCNetV2: Oversized Kernel with Enhanced Attention [60.141606180434195]
We introduce a convolutional neural network architecture named ParCNetV2.
It extends position-aware circular convolution (ParCNet) with oversized convolutions and strengthens attention through bifurcate gate units.
Our method outperforms other pure convolutional neural networks as well as neural networks hybridizing CNNs and transformers.
arXiv Detail & Related papers (2022-11-14T07:22:55Z) - Transformer with Fourier Integral Attentions [18.031977028559282]
We propose a new class of transformers in which the dot-product kernels are replaced by the novel generalized Fourier integral kernels.
Compared to the conventional transformers with dot-product attention, FourierFormers attain better accuracy and reduce the redundancy between attention heads.
We empirically corroborate the advantages of FourierFormers over the baseline transformers in a variety of practical applications including language modeling and image classification.
arXiv Detail & Related papers (2022-06-01T03:06:21Z) - nnFormer: Interleaved Transformer for Volumetric Segmentation [50.10441845967601]
We introduce nnFormer, a powerful segmentation model with an interleaved architecture based on empirical combination of self-attention and convolution.
nnFormer achieves tremendous improvements over previous transformer-based methods on two commonly used datasets Synapse and ACDC.
arXiv Detail & Related papers (2021-09-07T17:08:24Z) - Kernel Continual Learning [117.79080100313722]
kernel continual learning is a simple but effective variant of continual learning to tackle catastrophic forgetting.
episodic memory unit stores a subset of samples for each task to learn task-specific classifiers based on kernel ridge regression.
variational random features to learn a data-driven kernel for each task.
arXiv Detail & Related papers (2021-07-12T22:09:30Z) - Stable, Fast and Accurate: Kernelized Attention with Relative Positional
Encoding [63.539333383965726]
We propose a novel way to accelerate attention calculation for Transformers with relative positional encoding (RPE)
Based upon the observation that relative positional encoding forms a Toeplitz matrix, we mathematically show that kernelized attention with RPE can be calculated efficiently using Fast Fourier Transform (FFT)
arXiv Detail & Related papers (2021-06-23T17:51:26Z) - Kernel Identification Through Transformers [54.3795894579111]
Kernel selection plays a central role in determining the performance of Gaussian Process (GP) models.
This work addresses the challenge of constructing custom kernel functions for high-dimensional GP regression models.
We introduce a novel approach named KITT: Kernel Identification Through Transformers.
arXiv Detail & Related papers (2021-06-15T14:32:38Z) - Transformers are Deep Infinite-Dimensional Non-Mercer Binary Kernel
Machines [15.55404574021651]
We show that the "dot-product attention" that is the core of the Transformer's operation can be characterized as a kernel learning method on a pair of Banach spaces.
In particular, the Transformer's kernel is characterized as having an infinite feature dimension.
This paper's results provide a new theoretical understanding of a very important but poorly understood model in modern machinelearning.
arXiv Detail & Related papers (2021-06-02T23:24:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.