Related papers: On Learning the Transformer Kernel

On Learning the Transformer Kernel

URL: http://arxiv.org/abs/2110.08323v1
Date: Fri, 15 Oct 2021 19:20:25 GMT
Title: On Learning the Transformer Kernel
Authors: Sankalan Pal Chowdhury, Adamos Solomou, Avinava Dubey and Mrinmaya Sachan
Abstract summary: KERNELIZED TRANSFORMER is a generic, scalable, data driven framework for learning the kernel function in Transformers. Our framework approximates the Transformer kernel as a dot product between spectral feature maps and learns the kernel by learning the spectral distribution.
Score: 13.955526058823166
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: In this work we introduce KERNELIZED TRANSFORMER, a generic, scalable, data driven framework for learning the kernel function in Transformers. Our framework approximates the Transformer kernel as a dot product between spectral feature maps and learns the kernel by learning the spectral distribution. This not only helps in learning a generic kernel end-to-end, but also reduces the time and space complexity of Transformers from quadratic to linear. We show that KERNELIZED TRANSFORMERS achieve performance comparable to existing efficient Transformer architectures, both in terms of accuracy as well as computational efficiency. Our study also demonstrates that the choice of the kernel has a substantial impact on performance, and kernel learning variants are competitive alternatives to fixed kernel Transformers, both in long as well as short sequence tasks.

Related papers

Amortized Inference for Gaussian Process Hyperparameters of Structured Kernels [5.1672267755831705]
Amortizing parameter inference over different datasets is a promising approach to dramatically speed up training time. We propose amortizing kernel parameter inference over a complete kernel-structure-family rather than a fixed kernel structure. We show drastically reduced inference time combined with competitive test performance for a large set of kernels and datasets.
arXiv Detail & Related papers (2023-06-16T13:02:57Z)
RWKV: Reinventing RNNs for the Transformer Era [54.716108899349614]
We propose a novel model architecture that combines the efficient parallelizable training of transformers with the efficient inference of RNNs. We scale our models as large as 14 billion parameters, by far the largest dense RNN ever trained, and find RWKV performs on par with similarly sized Transformers.
arXiv Detail & Related papers (2023-05-22T13:57:41Z)
ParCNetV2: Oversized Kernel with Enhanced Attention [60.141606180434195]
We introduce a convolutional neural network architecture named ParCNetV2. It extends position-aware circular convolution (ParCNet) with oversized convolutions and strengthens attention through bifurcate gate units. Our method outperforms other pure convolutional neural networks as well as neural networks hybridizing CNNs and transformers.
arXiv Detail & Related papers (2022-11-14T07:22:55Z)
Transformer with Fourier Integral Attentions [18.031977028559282]
We propose a new class of transformers in which the dot-product kernels are replaced by the novel generalized Fourier integral kernels. Compared to the conventional transformers with dot-product attention, FourierFormers attain better accuracy and reduce the redundancy between attention heads. We empirically corroborate the advantages of FourierFormers over the baseline transformers in a variety of practical applications including language modeling and image classification.
arXiv Detail & Related papers (2022-06-01T03:06:21Z)
nnFormer: Interleaved Transformer for Volumetric Segmentation [50.10441845967601]
We introduce nnFormer, a powerful segmentation model with an interleaved architecture based on empirical combination of self-attention and convolution. nnFormer achieves tremendous improvements over previous transformer-based methods on two commonly used datasets Synapse and ACDC.
arXiv Detail & Related papers (2021-09-07T17:08:24Z)
Kernel Continual Learning [117.79080100313722]
kernel continual learning is a simple but effective variant of continual learning to tackle catastrophic forgetting. episodic memory unit stores a subset of samples for each task to learn task-specific classifiers based on kernel ridge regression. variational random features to learn a data-driven kernel for each task.
arXiv Detail & Related papers (2021-07-12T22:09:30Z)
Stable, Fast and Accurate: Kernelized Attention with Relative Positional Encoding [63.539333383965726]
We propose a novel way to accelerate attention calculation for Transformers with relative positional encoding (RPE) Based upon the observation that relative positional encoding forms a Toeplitz matrix, we mathematically show that kernelized attention with RPE can be calculated efficiently using Fast Fourier Transform (FFT)
arXiv Detail & Related papers (2021-06-23T17:51:26Z)
Kernel Identification Through Transformers [54.3795894579111]
Kernel selection plays a central role in determining the performance of Gaussian Process (GP) models. This work addresses the challenge of constructing custom kernel functions for high-dimensional GP regression models. We introduce a novel approach named KITT: Kernel Identification Through Transformers.
arXiv Detail & Related papers (2021-06-15T14:32:38Z)
Transformers are Deep Infinite-Dimensional Non-Mercer Binary Kernel Machines [15.55404574021651]
We show that the "dot-product attention" that is the core of the Transformer's operation can be characterized as a kernel learning method on a pair of Banach spaces. In particular, the Transformer's kernel is characterized as having an infinite feature dimension. This paper's results provide a new theoretical understanding of a very important but poorly understood model in modern machinelearning.
arXiv Detail & Related papers (2021-06-02T23:24:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.