Visualizing and Understanding Patch Interactions in Vision Transformer
- URL: http://arxiv.org/abs/2203.05922v1
- Date: Fri, 11 Mar 2022 13:48:11 GMT
- Title: Visualizing and Understanding Patch Interactions in Vision Transformer
- Authors: Jie Ma, Yalong Bai, Bineng Zhong, Wei Zhang, Ting Yao, Tao Mei
- Abstract summary: Vision Transformer (ViT) has become a leading tool in various computer vision tasks.
We propose a novel explainable visualization approach to analyze and interpret the crucial attention interactions among patches for vision transformer.
- Score: 96.70401478061076
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision Transformer (ViT) has become a leading tool in various computer vision
tasks, owing to its unique self-attention mechanism that learns visual
representations explicitly through cross-patch information interactions.
Despite having good success, the literature seldom explores the explainability
of vision transformer, and there is no clear picture of how the attention
mechanism with respect to the correlation across comprehensive patches will
impact the performance and what is the further potential. In this work, we
propose a novel explainable visualization approach to analyze and interpret the
crucial attention interactions among patches for vision transformer.
Specifically, we first introduce a quantification indicator to measure the
impact of patch interaction and verify such quantification on attention window
design and indiscriminative patches removal. Then, we exploit the effective
responsive field of each patch in ViT and devise a window-free transformer
architecture accordingly. Extensive experiments on ImageNet demonstrate that
the exquisitely designed quantitative method is shown able to facilitate ViT
model learning, leading the top-1 accuracy by 4.28% at most. Moreover, the
results on downstream fine-grained recognition tasks further validate the
generalization of our proposal.
Related papers
- DAPE V2: Process Attention Score as Feature Map for Length Extrapolation [63.87956583202729]
We conceptualize attention as a feature map and apply the convolution operator to mimic the processing methods in computer vision.
The novel insight, which can be adapted to various attention-related models, reveals that the current Transformer architecture has the potential for further evolution.
arXiv Detail & Related papers (2024-10-07T07:21:49Z) - Token Transformation Matters: Towards Faithful Post-hoc Explanation for Vision Transformer [16.97186100288621]
Vision Transformers extract visual information by representing regions as transformed tokens and integrating them via attention weights.
Existing post-hoc explanation methods merely consider these attention weights, neglecting crucial information from the transformed tokens.
We propose TokenTM, a novel post-hoc explanation method that utilizes our introduced measurement of token transformation effects.
arXiv Detail & Related papers (2024-03-21T16:52:27Z) - Attention Deficit is Ordered! Fooling Deformable Vision Transformers
with Collaborative Adversarial Patches [3.4673556247932225]
Deformable vision transformers significantly reduce the complexity of attention modeling.
Recent work has demonstrated adversarial attacks against conventional vision transformers.
We develop new collaborative attacks where a source patch manipulates attention to point to a target patch, which contains the adversarial noise to fool the model.
arXiv Detail & Related papers (2023-11-21T17:55:46Z) - AttentionViz: A Global View of Transformer Attention [60.82904477362676]
We present a new visualization technique designed to help researchers understand the self-attention mechanism in transformers.
The main idea behind our method is to visualize a joint embedding of the query and key vectors used by transformer models to compute attention.
We create an interactive visualization tool, AttentionViz, based on these joint query-key embeddings.
arXiv Detail & Related papers (2023-05-04T23:46:49Z) - What Makes for Good Tokenizers in Vision Transformer? [62.44987486771936]
transformers are capable of extracting their pairwise relationships using self-attention.
What makes for a good tokenizer has not been well understood in computer vision.
Modulation across Tokens (MoTo) incorporates inter-token modeling capability through normalization.
Regularization objective TokenProp is embraced in the standard training regime.
arXiv Detail & Related papers (2022-12-21T15:51:43Z) - Demystify Self-Attention in Vision Transformers from a Semantic
Perspective: Analysis and Application [21.161850569358776]
Self-attention mechanisms have achieved great success in many fields such as computer vision and natural language processing.
Many existing vision transformer (ViT) works simply inherent transformer designs from NLP to adapt vision tasks.
This paper introduces a typical image processing technique, which maps low-level representations into mid-level spaces, and annotates extensive discrete keypoints with semantically rich information.
arXiv Detail & Related papers (2022-11-13T15:18:31Z) - Multi-manifold Attention for Vision Transformers [12.862540139118073]
Vision Transformers are very popular nowadays due to their state-of-the-art performance in several computer vision tasks.
A novel attention mechanism, called multi-manifold multihead attention, is proposed in this work to substitute the vanilla self-attention of a Transformer.
arXiv Detail & Related papers (2022-07-18T12:53:53Z) - AdaViT: Adaptive Vision Transformers for Efficient Image Recognition [78.07924262215181]
We introduce AdaViT, an adaptive framework that learns to derive usage policies on which patches, self-attention heads and transformer blocks to use.
Our method obtains more than 2x improvement on efficiency compared to state-of-the-art vision transformers with only 0.8% drop of accuracy.
arXiv Detail & Related papers (2021-11-30T18:57:02Z) - Blending Anti-Aliasing into Vision Transformer [57.88274087198552]
discontinuous patch-wise tokenization process implicitly introduces jagged artifacts into attention maps.
Aliasing effect occurs when discrete patterns are used to produce high frequency or continuous information, resulting in the indistinguishable distortions.
We propose a plug-and-play Aliasing-Reduction Module(ARM) to alleviate the aforementioned issue.
arXiv Detail & Related papers (2021-10-28T14:30:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.