Multi-manifold Attention for Vision Transformers
- URL: http://arxiv.org/abs/2207.08569v3
- Date: Tue, 5 Sep 2023 09:05:15 GMT
- Title: Multi-manifold Attention for Vision Transformers
- Authors: Dimitrios Konstantinidis, Ilias Papastratis, Kosmas Dimitropoulos,
Petros Daras
- Abstract summary: Vision Transformers are very popular nowadays due to their state-of-the-art performance in several computer vision tasks.
A novel attention mechanism, called multi-manifold multihead attention, is proposed in this work to substitute the vanilla self-attention of a Transformer.
- Score: 12.862540139118073
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision Transformers are very popular nowadays due to their state-of-the-art
performance in several computer vision tasks, such as image classification and
action recognition. Although their performance has been greatly enhanced
through highly descriptive patch embeddings and hierarchical structures, there
is still limited research on utilizing additional data representations so as to
refine the selfattention map of a Transformer. To address this problem, a novel
attention mechanism, called multi-manifold multihead attention, is proposed in
this work to substitute the vanilla self-attention of a Transformer. The
proposed mechanism models the input space in three distinct manifolds, namely
Euclidean, Symmetric Positive Definite and Grassmann, thus leveraging different
statistical and geometrical properties of the input for the computation of a
highly descriptive attention map. In this way, the proposed attention mechanism
can guide a Vision Transformer to become more attentive towards important
appearance, color and texture features of an image, leading to improved
classification and segmentation results, as shown by the experimental results
on well-known datasets.
Related papers
- DAPE V2: Process Attention Score as Feature Map for Length Extrapolation [63.87956583202729]
We conceptualize attention as a feature map and apply the convolution operator to mimic the processing methods in computer vision.
The novel insight, which can be adapted to various attention-related models, reveals that the current Transformer architecture has the potential for further evolution.
arXiv Detail & Related papers (2024-10-07T07:21:49Z) - Attention Deficit is Ordered! Fooling Deformable Vision Transformers
with Collaborative Adversarial Patches [3.4673556247932225]
Deformable vision transformers significantly reduce the complexity of attention modeling.
Recent work has demonstrated adversarial attacks against conventional vision transformers.
We develop new collaborative attacks where a source patch manipulates attention to point to a target patch, which contains the adversarial noise to fool the model.
arXiv Detail & Related papers (2023-11-21T17:55:46Z) - Vision Transformers for Action Recognition: A Survey [41.69370782177517]
Vision transformers are emerging as a powerful tool to solve computer vision problems.
Recent techniques have proven the efficacy of transformers beyond the image domain to solve numerous video-related tasks.
Human action recognition is receiving special attention from the research community due to its widespread applications.
arXiv Detail & Related papers (2022-09-13T02:57:05Z) - Vision Transformer with Convolutions Architecture Search [72.70461709267497]
We propose an architecture search method-Vision Transformer with Convolutions Architecture Search (VTCAS)
The high-performance backbone network searched by VTCAS introduces the desirable features of convolutional neural networks into the Transformer architecture.
It enhances the robustness of the neural network for object recognition, especially in the low illumination indoor scene.
arXiv Detail & Related papers (2022-03-20T02:59:51Z) - Visualizing and Understanding Patch Interactions in Vision Transformer [96.70401478061076]
Vision Transformer (ViT) has become a leading tool in various computer vision tasks.
We propose a novel explainable visualization approach to analyze and interpret the crucial attention interactions among patches for vision transformer.
arXiv Detail & Related papers (2022-03-11T13:48:11Z) - Blending Anti-Aliasing into Vision Transformer [57.88274087198552]
discontinuous patch-wise tokenization process implicitly introduces jagged artifacts into attention maps.
Aliasing effect occurs when discrete patterns are used to produce high frequency or continuous information, resulting in the indistinguishable distortions.
We propose a plug-and-play Aliasing-Reduction Module(ARM) to alleviate the aforementioned issue.
arXiv Detail & Related papers (2021-10-28T14:30:02Z) - Generic Attention-model Explainability for Interpreting Bi-Modal and
Encoder-Decoder Transformers [78.26411729589526]
We propose the first method to explain prediction by any Transformer-based architecture.
Our method is superior to all existing methods which are adapted from single modality explainability.
arXiv Detail & Related papers (2021-03-29T15:03:11Z) - Evolving Attention with Residual Convolutions [29.305149185821882]
We propose a novel mechanism based on evolving attention to improve the performance of transformers.
The proposed attention mechanism achieves significant performance improvement over various state-of-the-art models for multiple tasks.
arXiv Detail & Related papers (2021-02-20T15:24:06Z) - A Survey on Visual Transformer [126.56860258176324]
Transformer is a type of deep neural network mainly based on the self-attention mechanism.
In this paper, we review these vision transformer models by categorizing them in different tasks and analyzing their advantages and disadvantages.
arXiv Detail & Related papers (2020-12-23T09:37:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.