Related papers: Translational Equivariance in Kernelizable Attention

Translational Equivariance in Kernelizable Attention

URL: http://arxiv.org/abs/2102.07680v1
Date: Mon, 15 Feb 2021 17:14:15 GMT
Title: Translational Equivariance in Kernelizable Attention
Authors: Max Horn, Kumar Shridhar, Elrich Groenewald, Philipp F. M. Baumann
Abstract summary: We show how translational equivariance can be implemented in efficient Transformers based on kernelizable attention. Our experiments highlight that the devised approach significantly improves robustness of Performers to shifts of input images.
Score: 3.236198583140341
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: While Transformer architectures have show remarkable success, they are bound to the computation of all pairwise interactions of input element and thus suffer from limited scalability. Recent work has been successful by avoiding the computation of the complete attention matrix, yet leads to problems down the line. The absence of an explicit attention matrix makes the inclusion of inductive biases relying on relative interactions between elements more challenging. An extremely powerful inductive bias is translational equivariance, which has been conjectured to be responsible for much of the success of Convolutional Neural Networks on image recognition tasks. In this work we show how translational equivariance can be implemented in efficient Transformers based on kernelizable attention - Performers. Our experiments highlight that the devised approach significantly improves robustness of Performers to shifts of input images compared to their naive application. This represents an important step on the path of replacing Convolutional Neural Networks with more expressive Transformer architectures and will help to improve sample efficiency and robustness in this realm.

Related papers

Enhancing Image Restoration Transformer via Adaptive Translation Equivariance [4.302970926810013]
We develop an adaptive sliding indexing mechanism to efficiently select key-value pairs for each query, which are then generalizationd in parallel with globally aggregated key-value pairs.<n>The results highlight its superiority in terms of effectiveness, training convergence, and generalization.
arXiv Detail & Related papers (2025-06-23T11:23:04Z)
Exploring Kernel Transformations for Implicit Neural Representations [57.2225355625268]
Implicit neural representations (INRs) leverage neural networks to represent signals by mapping coordinates to their corresponding attributes. This work pioneers the exploration of the effect of kernel transformation of input/output while keeping the model itself unchanged. A byproduct of our findings is a simple yet effective method that combines scale and shift to significantly boost INR with negligible overhead.
arXiv Detail & Related papers (2025-04-07T04:43:50Z)
Transformer Meets Twicing: Harnessing Unattended Residual Information [2.1605931466490795]
Transformer-based deep learning models have achieved state-of-the-art performance across numerous language and vision tasks. While the self-attention mechanism has proven capable of handling complex data patterns, it has been observed that the representational capacity of the attention matrix degrades significantly across transformer layers. We propose the Twicing Attention, a novel attention mechanism that uses kernel twicing procedure in nonparametric regression to alleviate the low-pass behavior of associated NLM smoothing.
arXiv Detail & Related papers (2025-03-02T01:56:35Z)
DAPE V2: Process Attention Score as Feature Map for Length Extrapolation [63.87956583202729]
We conceptualize attention as a feature map and apply the convolution operator to mimic the processing methods in computer vision. The novel insight, which can be adapted to various attention-related models, reveals that the current Transformer architecture has the potential for further evolution.
arXiv Detail & Related papers (2024-10-07T07:21:49Z)
Strengthening Structural Inductive Biases by Pre-training to Perform Syntactic Transformations [75.14793516745374]
We propose to strengthen the structural inductive bias of a Transformer by intermediate pre-training. Our experiments confirm that this helps with few-shot learning of syntactic tasks such as chunking. Our analysis shows that the intermediate pre-training leads to attention heads that keep track of which syntactic transformation needs to be applied to which token.
arXiv Detail & Related papers (2024-07-05T14:29:44Z)
Transformers Learn Low Sensitivity Functions: Investigations and Implications [18.77893015276986]
Transformers achieve state-of-the-art accuracy and robustness across many tasks. We identify the sensitivity of the model to token-wise random perturbations in the input as a unified metric. We show that transformers have lower sensitivity than CNNs, CNNs, ConvMixers and LSTMs, across both vision and language tasks.
arXiv Detail & Related papers (2024-03-11T17:12:09Z)
FAST: Factorizable Attention for Speeding up Transformers [1.3637227185793512]
We present a linearly scaled attention mechanism that maintains the full representation of the attention matrix without compromising on sparsification. Results indicate that our attention mechanism has a robust performance and holds significant promise for diverse applications where self-attention is used.
arXiv Detail & Related papers (2024-02-12T18:59:39Z)
Representational Strengths and Limitations of Transformers [33.659870765923884]
We establish both positive and negative results on the representation power of attention layers. We show the necessity and role of a large embedding dimension in a transformer. We also present natural variants that can be efficiently solved by attention layers.
arXiv Detail & Related papers (2023-06-05T14:05:04Z)
Empowering Networks With Scale and Rotation Equivariance Using A Similarity Convolution [16.853711292804476]
We devise a method that endows CNNs with simultaneous equivariance with respect to translation, rotation, and scaling. Our approach defines a convolution-like operation and ensures equivariance based on our proposed scalable Fourier-Argand representation. We validate the efficacy of our approach in the image classification task, demonstrating its robustness and the generalization ability to both scaled and rotated inputs.
arXiv Detail & Related papers (2023-03-01T08:43:05Z)
HEAT: Hardware-Efficient Automatic Tensor Decomposition for Transformer Compression [69.36555801766762]
We propose a hardware-aware tensor decomposition framework, dubbed HEAT, that enables efficient exploration of the exponential space of possible decompositions. We experimentally show that our hardware-aware factorized BERT variants reduce the energy-delay product by 5.7x with less than 1.1% accuracy loss.
arXiv Detail & Related papers (2022-11-30T05:31:45Z)
Adaptive Fourier Neural Operators: Efficient Token Mixers for Transformers [55.90468016961356]
We propose an efficient token mixer that learns to mix in the Fourier domain. AFNO is based on a principled foundation of operator learning. It can handle a sequence size of 65k and outperforms other efficient self-attention mechanisms.
arXiv Detail & Related papers (2021-11-24T05:44:31Z)
CETransformer: Casual Effect Estimation via Transformer Based Representation Learning [17.622007687796756]
Data-driven causal effect estimation faces two main challenges, i.e., selection bias and the missing of counterfactual. To address these two issues, most of the existing approaches tend to reduce the selection bias by learning a balanced representation. We propose a CETransformer model for casual effect estimation via transformer based representation learning.
arXiv Detail & Related papers (2021-07-19T09:39:57Z)
Attention that does not Explain Away [54.42960937271612]
Models based on the Transformer architecture have achieved better accuracy than the ones based on competing architectures for a large set of tasks. A unique feature of the Transformer is its universal application of a self-attention mechanism, which allows for free information flow at arbitrary distances. We propose a doubly-normalized attention scheme that is simple to implement and provides theoretical guarantees for avoiding the "explaining away" effect.
arXiv Detail & Related papers (2020-09-29T21:05:39Z)
Robustness Verification for Transformers [165.25112192811764]
We develop the first robustness verification algorithm for Transformers. The certified robustness bounds computed by our method are significantly tighter than those by naive Interval Bound propagation. These bounds also shed light on interpreting Transformers as they consistently reflect the importance of different words in sentiment analysis.
arXiv Detail & Related papers (2020-02-16T17:16:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.