Self-attention in Vision Transformers Performs Perceptual Grouping, Not
Attention
- URL: http://arxiv.org/abs/2303.01542v1
- Date: Thu, 2 Mar 2023 19:18:11 GMT
- Title: Self-attention in Vision Transformers Performs Perceptual Grouping, Not
Attention
- Authors: Paria Mehrani and John K. Tsotsos
- Abstract summary: We show that attention mechanisms in vision transformers exhibit similar effects as those known in human visual attention.
Our results suggest that self-attention modules group figures in the stimuli based on similarity in visual features such as color.
In a singleton detection experiment, we studied if these models exhibit similar effects as those of feed-forward visual salience mechanisms utilized in human visual attention.
- Score: 11.789983276366986
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Recently, a considerable number of studies in computer vision involves deep
neural architectures called vision transformers. Visual processing in these
models incorporates computational models that are claimed to implement
attention mechanisms. Despite an increasing body of work that attempts to
understand the role of attention mechanisms in vision transformers, their
effect is largely unknown. Here, we asked if the attention mechanisms in vision
transformers exhibit similar effects as those known in human visual attention.
To answer this question, we revisited the attention formulation in these models
and found that despite the name, computationally, these models perform a
special class of relaxation labeling with similarity grouping effects.
Additionally, whereas modern experimental findings reveal that human visual
attention involves both feed-forward and feedback mechanisms, the purely
feed-forward architecture of vision transformers suggests that attention in
these models will not have the same effects as those known in humans. To
quantify these observations, we evaluated grouping performance in a family of
vision transformers. Our results suggest that self-attention modules group
figures in the stimuli based on similarity in visual features such as color.
Also, in a singleton detection experiment as an instance of saliency detection,
we studied if these models exhibit similar effects as those of feed-forward
visual salience mechanisms utilized in human visual attention. We found that
generally, the transformer-based attention modules assign more salience either
to distractors or the ground. Together, our study suggests that the attention
mechanisms in vision transformers perform similarity grouping and not
attention.
Related papers
- Affinity-based Attention in Self-supervised Transformers Predicts
Dynamics of Object Grouping in Humans [2.485182034310303]
We propose a model of human object-based attention spreading and segmentation.
Our work provides new benchmarks for evaluating models of visual representation learning including Transformers.
arXiv Detail & Related papers (2023-06-01T02:25:55Z) - AttentionViz: A Global View of Transformer Attention [60.82904477362676]
We present a new visualization technique designed to help researchers understand the self-attention mechanism in transformers.
The main idea behind our method is to visualize a joint embedding of the query and key vectors used by transformer models to compute attention.
We create an interactive visualization tool, AttentionViz, based on these joint query-key embeddings.
arXiv Detail & Related papers (2023-05-04T23:46:49Z) - Multi-manifold Attention for Vision Transformers [12.862540139118073]
Vision Transformers are very popular nowadays due to their state-of-the-art performance in several computer vision tasks.
A novel attention mechanism, called multi-manifold multihead attention, is proposed in this work to substitute the vanilla self-attention of a Transformer.
arXiv Detail & Related papers (2022-07-18T12:53:53Z) - Deep Active Visual Attention for Real-time Robot Motion Generation:
Emergence of Tool-body Assimilation and Adaptive Tool-use [9.141661467673817]
This paper proposes a novel robot motion generation model, inspired by a human cognitive structure.
The model incorporates a state-driven active top-down visual attention module, which acquires attentions that can actively change targets based on task states.
The results suggested an improvement of flexibility in model's visual perception, which sustained stable attention and motion even if it was provided with untrained tools or exposed to experimenter's distractions.
arXiv Detail & Related papers (2022-06-29T10:55:32Z) - Visualizing and Understanding Patch Interactions in Vision Transformer [96.70401478061076]
Vision Transformer (ViT) has become a leading tool in various computer vision tasks.
We propose a novel explainable visualization approach to analyze and interpret the crucial attention interactions among patches for vision transformer.
arXiv Detail & Related papers (2022-03-11T13:48:11Z) - Attention Mechanisms in Computer Vision: A Survey [75.6074182122423]
We provide a comprehensive review of various attention mechanisms in computer vision.
We categorize them according to approach, such as channel attention, spatial attention, temporal attention and branch attention.
We suggest future directions for attention mechanism research.
arXiv Detail & Related papers (2021-11-15T09:18:40Z) - Generic Attention-model Explainability for Interpreting Bi-Modal and
Encoder-Decoder Transformers [78.26411729589526]
We propose the first method to explain prediction by any Transformer-based architecture.
Our method is superior to all existing methods which are adapted from single modality explainability.
arXiv Detail & Related papers (2021-03-29T15:03:11Z) - SparseBERT: Rethinking the Importance Analysis in Self-attention [107.68072039537311]
Transformer-based models are popular for natural language processing (NLP) tasks due to its powerful capacity.
Attention map visualization of a pre-trained model is one direct method for understanding self-attention mechanism.
We propose a Differentiable Attention Mask (DAM) algorithm, which can be also applied in guidance of SparseBERT design.
arXiv Detail & Related papers (2021-02-25T14:13:44Z) - A Survey on Visual Transformer [126.56860258176324]
Transformer is a type of deep neural network mainly based on the self-attention mechanism.
In this paper, we review these vision transformer models by categorizing them in different tasks and analyzing their advantages and disadvantages.
arXiv Detail & Related papers (2020-12-23T09:37:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.