Related papers: Visualizing and Understanding Patch Interactions in Vision Transformer

Visualizing and Understanding Patch Interactions in Vision Transformer

URL: http://arxiv.org/abs/2203.05922v1
Date: Fri, 11 Mar 2022 13:48:11 GMT
Title: Visualizing and Understanding Patch Interactions in Vision Transformer
Authors: Jie Ma, Yalong Bai, Bineng Zhong, Wei Zhang, Ting Yao, Tao Mei
Abstract summary: Vision Transformer (ViT) has become a leading tool in various computer vision tasks. We propose a novel explainable visualization approach to analyze and interpret the crucial attention interactions among patches for vision transformer.
Score: 96.70401478061076
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Vision Transformer (ViT) has become a leading tool in various computer vision tasks, owing to its unique self-attention mechanism that learns visual representations explicitly through cross-patch information interactions. Despite having good success, the literature seldom explores the explainability of vision transformer, and there is no clear picture of how the attention mechanism with respect to the correlation across comprehensive patches will impact the performance and what is the further potential. In this work, we propose a novel explainable visualization approach to analyze and interpret the crucial attention interactions among patches for vision transformer. Specifically, we first introduce a quantification indicator to measure the impact of patch interaction and verify such quantification on attention window design and indiscriminative patches removal. Then, we exploit the effective responsive field of each patch in ViT and devise a window-free transformer architecture accordingly. Extensive experiments on ImageNet demonstrate that the exquisitely designed quantitative method is shown able to facilitate ViT model learning, leading the top-1 accuracy by 4.28% at most. Moreover, the results on downstream fine-grained recognition tasks further validate the generalization of our proposal.

Related papers

Disentangling Visual Transformers: Patch-level Interpretability for Image Classification [2.899118947717404]
We propose Hindered Transformer (HiT), a novel interpretable by design architecture inspired by visual transformers. HiT can be interpreted as a linear combination of patch-level information. We show that the advantages of our approach in terms of explicability come with a reasonable trade-off in performance.
arXiv Detail & Related papers (2025-02-24T14:30:29Z)
DAPE V2: Process Attention Score as Feature Map for Length Extrapolation [63.87956583202729]
We conceptualize attention as a feature map and apply the convolution operator to mimic the processing methods in computer vision. The novel insight, which can be adapted to various attention-related models, reveals that the current Transformer architecture has the potential for further evolution.
arXiv Detail & Related papers (2024-10-07T07:21:49Z)
Token Transformation Matters: Towards Faithful Post-hoc Explanation for Vision Transformer [16.97186100288621]
Vision Transformers extract visual information by representing regions as transformed tokens and integrating them via attention weights. Existing post-hoc explanation methods merely consider these attention weights, neglecting crucial information from the transformed tokens. We propose TokenTM, a novel post-hoc explanation method that utilizes our introduced measurement of token transformation effects.
arXiv Detail & Related papers (2024-03-21T16:52:27Z)
Attention Deficit is Ordered! Fooling Deformable Vision Transformers with Collaborative Adversarial Patches [3.4673556247932225]
Deformable vision transformers significantly reduce the complexity of attention modeling. Recent work has demonstrated adversarial attacks against conventional vision transformers. We develop new collaborative attacks where a source patch manipulates attention to point to a target patch, which contains the adversarial noise to fool the model.
arXiv Detail & Related papers (2023-11-21T17:55:46Z)
AttentionViz: A Global View of Transformer Attention [60.82904477362676]
We present a new visualization technique designed to help researchers understand the self-attention mechanism in transformers. The main idea behind our method is to visualize a joint embedding of the query and key vectors used by transformer models to compute attention. We create an interactive visualization tool, AttentionViz, based on these joint query-key embeddings.
arXiv Detail & Related papers (2023-05-04T23:46:49Z)
What Makes for Good Tokenizers in Vision Transformer? [62.44987486771936]
transformers are capable of extracting their pairwise relationships using self-attention. What makes for a good tokenizer has not been well understood in computer vision. Modulation across Tokens (MoTo) incorporates inter-token modeling capability through normalization. Regularization objective TokenProp is embraced in the standard training regime.
arXiv Detail & Related papers (2022-12-21T15:51:43Z)
Demystify Self-Attention in Vision Transformers from a Semantic Perspective: Analysis and Application [21.161850569358776]
Self-attention mechanisms have achieved great success in many fields such as computer vision and natural language processing. Many existing vision transformer (ViT) works simply inherent transformer designs from NLP to adapt vision tasks. This paper introduces a typical image processing technique, which maps low-level representations into mid-level spaces, and annotates extensive discrete keypoints with semantically rich information.
arXiv Detail & Related papers (2022-11-13T15:18:31Z)
Multi-manifold Attention for Vision Transformers [12.862540139118073]
Vision Transformers are very popular nowadays due to their state-of-the-art performance in several computer vision tasks. A novel attention mechanism, called multi-manifold multihead attention, is proposed in this work to substitute the vanilla self-attention of a Transformer.
arXiv Detail & Related papers (2022-07-18T12:53:53Z)
AdaViT: Adaptive Vision Transformers for Efficient Image Recognition [78.07924262215181]
We introduce AdaViT, an adaptive framework that learns to derive usage policies on which patches, self-attention heads and transformer blocks to use. Our method obtains more than 2x improvement on efficiency compared to state-of-the-art vision transformers with only 0.8% drop of accuracy.
arXiv Detail & Related papers (2021-11-30T18:57:02Z)
Blending Anti-Aliasing into Vision Transformer [57.88274087198552]
discontinuous patch-wise tokenization process implicitly introduces jagged artifacts into attention maps. Aliasing effect occurs when discrete patterns are used to produce high frequency or continuous information, resulting in the indistinguishable distortions. We propose a plug-and-play Aliasing-Reduction Module(ARM) to alleviate the aforementioned issue.
arXiv Detail & Related papers (2021-10-28T14:30:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.