Transformers are Deep Infinite-Dimensional Non-Mercer Binary Kernel
Machines
- URL: http://arxiv.org/abs/2106.01506v1
- Date: Wed, 2 Jun 2021 23:24:06 GMT
- Title: Transformers are Deep Infinite-Dimensional Non-Mercer Binary Kernel
Machines
- Authors: Matthew A. Wright, Joseph E. Gonzalez
- Abstract summary: We show that the "dot-product attention" that is the core of the Transformer's operation can be characterized as a kernel learning method on a pair of Banach spaces.
In particular, the Transformer's kernel is characterized as having an infinite feature dimension.
This paper's results provide a new theoretical understanding of a very important but poorly understood model in modern machinelearning.
- Score: 15.55404574021651
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite their ubiquity in core AI fields like natural language processing,
the mechanics of deep attention-based neural networks like the Transformer
model are not fully understood. In this article, we present a new perspective
towards understanding how Transformers work. In particular, we show that the
"dot-product attention" that is the core of the Transformer's operation can be
characterized as a kernel learning method on a pair of Banach spaces. In
particular, the Transformer's kernel is characterized as having an infinite
feature dimension. Along the way we consider an extension of the standard
kernel learning problem to a binary setting, where data come from two input
domains and a response is defined for every cross-domain pair. We prove a new
representer theorem for these binary kernel machines with non-Mercer
(indefinite, asymmetric) kernels (implying that the functions learned are
elements of reproducing kernel Banach spaces rather than Hilbert spaces), and
also prove a new universal approximation theorem showing that the Transformer
calculation can learn any binary non-Mercer reproducing kernel Banach space
pair. We experiment with new kernels in Transformers, and obtain results that
suggest the infinite dimensionality of the standard Transformer kernel is
partially responsible for its performance. This paper's results provide a new
theoretical understanding of a very important but poorly understood model in
modern machine~learning.
Related papers
- Spectral Truncation Kernels: Noncommutativity in $C^*$-algebraic Kernel Machines [12.11705128358537]
We propose a new class of positive definite kernels based on the spectral truncation.
We show that the truncation parameter $n$ is a governing factor leading to performance enhancement.
arXiv Detail & Related papers (2024-05-28T04:47:12Z) - On the Convergence of Encoder-only Shallow Transformers [62.639819460956176]
We build the global convergence theory of encoder-only shallow Transformers under a realistic setting.
Our results can pave the way for a better understanding of modern Transformers, particularly on training dynamics.
arXiv Detail & Related papers (2023-11-02T20:03:05Z) - All Roads Lead to Rome? Exploring the Invariance of Transformers'
Representations [69.3461199976959]
We propose a model based on invertible neural networks, BERT-INN, to learn the Bijection Hypothesis.
We show the advantage of BERT-INN both theoretically and through extensive experiments.
arXiv Detail & Related papers (2023-05-23T22:30:43Z) - Deep Transformers without Shortcuts: Modifying Self-attention for
Faithful Signal Propagation [105.22961467028234]
Skip connections and normalisation layers are ubiquitous for the training of Deep Neural Networks (DNNs)
Recent approaches such as Deep Kernel Shaping have made progress towards reducing our reliance on them.
But these approaches are incompatible with the self-attention layers present in transformers.
arXiv Detail & Related papers (2023-02-20T21:26:25Z) - ParCNetV2: Oversized Kernel with Enhanced Attention [60.141606180434195]
We introduce a convolutional neural network architecture named ParCNetV2.
It extends position-aware circular convolution (ParCNet) with oversized convolutions and strengthens attention through bifurcate gate units.
Our method outperforms other pure convolutional neural networks as well as neural networks hybridizing CNNs and transformers.
arXiv Detail & Related papers (2022-11-14T07:22:55Z) - The Parallelism Tradeoff: Limitations of Log-Precision Transformers [29.716269397142973]
We prove that transformers whose arithmetic precision is logarithmic in the number of input tokens can be simulated by constant-depth logspace-uniform threshold circuits.
This provides insight on the power of transformers using known results in complexity theory.
arXiv Detail & Related papers (2022-07-02T03:49:34Z) - Transformer with Fourier Integral Attentions [18.031977028559282]
We propose a new class of transformers in which the dot-product kernels are replaced by the novel generalized Fourier integral kernels.
Compared to the conventional transformers with dot-product attention, FourierFormers attain better accuracy and reduce the redundancy between attention heads.
We empirically corroborate the advantages of FourierFormers over the baseline transformers in a variety of practical applications including language modeling and image classification.
arXiv Detail & Related papers (2022-06-01T03:06:21Z) - On Learning the Transformer Kernel [13.955526058823166]
KERNELIZED TRANSFORMER is a generic, scalable, data driven framework for learning the kernel function in Transformers.
Our framework approximates the Transformer kernel as a dot product between spectral feature maps and learns the kernel by learning the spectral distribution.
arXiv Detail & Related papers (2021-10-15T19:20:25Z) - nnFormer: Interleaved Transformer for Volumetric Segmentation [50.10441845967601]
We introduce nnFormer, a powerful segmentation model with an interleaved architecture based on empirical combination of self-attention and convolution.
nnFormer achieves tremendous improvements over previous transformer-based methods on two commonly used datasets Synapse and ACDC.
arXiv Detail & Related papers (2021-09-07T17:08:24Z) - Kernel Identification Through Transformers [54.3795894579111]
Kernel selection plays a central role in determining the performance of Gaussian Process (GP) models.
This work addresses the challenge of constructing custom kernel functions for high-dimensional GP regression models.
We introduce a novel approach named KITT: Kernel Identification Through Transformers.
arXiv Detail & Related papers (2021-06-15T14:32:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.