Translational Equivariance in Kernelizable Attention
- URL: http://arxiv.org/abs/2102.07680v1
- Date: Mon, 15 Feb 2021 17:14:15 GMT
- Title: Translational Equivariance in Kernelizable Attention
- Authors: Max Horn, Kumar Shridhar, Elrich Groenewald, Philipp F. M. Baumann
- Abstract summary: We show how translational equivariance can be implemented in efficient Transformers based on kernelizable attention.
Our experiments highlight that the devised approach significantly improves robustness of Performers to shifts of input images.
- Score: 3.236198583140341
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While Transformer architectures have show remarkable success, they are bound
to the computation of all pairwise interactions of input element and thus
suffer from limited scalability. Recent work has been successful by avoiding
the computation of the complete attention matrix, yet leads to problems down
the line. The absence of an explicit attention matrix makes the inclusion of
inductive biases relying on relative interactions between elements more
challenging. An extremely powerful inductive bias is translational
equivariance, which has been conjectured to be responsible for much of the
success of Convolutional Neural Networks on image recognition tasks. In this
work we show how translational equivariance can be implemented in efficient
Transformers based on kernelizable attention - Performers. Our experiments
highlight that the devised approach significantly improves robustness of
Performers to shifts of input images compared to their naive application. This
represents an important step on the path of replacing Convolutional Neural
Networks with more expressive Transformer architectures and will help to
improve sample efficiency and robustness in this realm.
Related papers
- DAPE V2: Process Attention Score as Feature Map for Length Extrapolation [63.87956583202729]
We conceptualize attention as a feature map and apply the convolution operator to mimic the processing methods in computer vision.
The novel insight, which can be adapted to various attention-related models, reveals that the current Transformer architecture has the potential for further evolution.
arXiv Detail & Related papers (2024-10-07T07:21:49Z) - Strengthening Structural Inductive Biases by Pre-training to Perform Syntactic Transformations [75.14793516745374]
We propose to strengthen the structural inductive bias of a Transformer by intermediate pre-training.
Our experiments confirm that this helps with few-shot learning of syntactic tasks such as chunking.
Our analysis shows that the intermediate pre-training leads to attention heads that keep track of which syntactic transformation needs to be applied to which token.
arXiv Detail & Related papers (2024-07-05T14:29:44Z) - FAST: Factorizable Attention for Speeding up Transformers [1.3637227185793512]
We present a linearly scaled attention mechanism that maintains the full representation of the attention matrix without compromising on sparsification.
Results indicate that our attention mechanism has a robust performance and holds significant promise for diverse applications where self-attention is used.
arXiv Detail & Related papers (2024-02-12T18:59:39Z) - Representational Strengths and Limitations of Transformers [33.659870765923884]
We establish both positive and negative results on the representation power of attention layers.
We show the necessity and role of a large embedding dimension in a transformer.
We also present natural variants that can be efficiently solved by attention layers.
arXiv Detail & Related papers (2023-06-05T14:05:04Z) - Empowering Networks With Scale and Rotation Equivariance Using A
Similarity Convolution [16.853711292804476]
We devise a method that endows CNNs with simultaneous equivariance with respect to translation, rotation, and scaling.
Our approach defines a convolution-like operation and ensures equivariance based on our proposed scalable Fourier-Argand representation.
We validate the efficacy of our approach in the image classification task, demonstrating its robustness and the generalization ability to both scaled and rotated inputs.
arXiv Detail & Related papers (2023-03-01T08:43:05Z) - HEAT: Hardware-Efficient Automatic Tensor Decomposition for Transformer
Compression [69.36555801766762]
We propose a hardware-aware tensor decomposition framework, dubbed HEAT, that enables efficient exploration of the exponential space of possible decompositions.
We experimentally show that our hardware-aware factorized BERT variants reduce the energy-delay product by 5.7x with less than 1.1% accuracy loss.
arXiv Detail & Related papers (2022-11-30T05:31:45Z) - Adaptive Fourier Neural Operators: Efficient Token Mixers for
Transformers [55.90468016961356]
We propose an efficient token mixer that learns to mix in the Fourier domain.
AFNO is based on a principled foundation of operator learning.
It can handle a sequence size of 65k and outperforms other efficient self-attention mechanisms.
arXiv Detail & Related papers (2021-11-24T05:44:31Z) - CETransformer: Casual Effect Estimation via Transformer Based
Representation Learning [17.622007687796756]
Data-driven causal effect estimation faces two main challenges, i.e., selection bias and the missing of counterfactual.
To address these two issues, most of the existing approaches tend to reduce the selection bias by learning a balanced representation.
We propose a CETransformer model for casual effect estimation via transformer based representation learning.
arXiv Detail & Related papers (2021-07-19T09:39:57Z) - Attention that does not Explain Away [54.42960937271612]
Models based on the Transformer architecture have achieved better accuracy than the ones based on competing architectures for a large set of tasks.
A unique feature of the Transformer is its universal application of a self-attention mechanism, which allows for free information flow at arbitrary distances.
We propose a doubly-normalized attention scheme that is simple to implement and provides theoretical guarantees for avoiding the "explaining away" effect.
arXiv Detail & Related papers (2020-09-29T21:05:39Z) - Robustness Verification for Transformers [165.25112192811764]
We develop the first robustness verification algorithm for Transformers.
The certified robustness bounds computed by our method are significantly tighter than those by naive Interval Bound propagation.
These bounds also shed light on interpreting Transformers as they consistently reflect the importance of different words in sentiment analysis.
arXiv Detail & Related papers (2020-02-16T17:16:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.