Related papers: Token Transformation Matters: Towards Faithful Post-hoc Explanation for Vision Transformer

Token Transformation Matters: Towards Faithful Post-hoc Explanation for Vision Transformer

URL: http://arxiv.org/abs/2403.14552v1
Date: Thu, 21 Mar 2024 16:52:27 GMT
Title: Token Transformation Matters: Towards Faithful Post-hoc Explanation for Vision Transformer
Authors: Junyi Wu, Bin Duan, Weitai Kang, Hao Tang, Yan Yan,
Abstract summary: Vision Transformers extract visual information by representing regions as transformed tokens and integrating them via attention weights. Existing post-hoc explanation methods merely consider these attention weights, neglecting crucial information from the transformed tokens. We propose TokenTM, a novel post-hoc explanation method that utilizes our introduced measurement of token transformation effects.
Score: 16.97186100288621
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: While Transformers have rapidly gained popularity in various computer vision applications, post-hoc explanations of their internal mechanisms remain largely unexplored. Vision Transformers extract visual information by representing image regions as transformed tokens and integrating them via attention weights. However, existing post-hoc explanation methods merely consider these attention weights, neglecting crucial information from the transformed tokens, which fails to accurately illustrate the rationales behind the models' predictions. To incorporate the influence of token transformation into interpretation, we propose TokenTM, a novel post-hoc explanation method that utilizes our introduced measurement of token transformation effects. Specifically, we quantify token transformation effects by measuring changes in token lengths and correlations in their directions pre- and post-transformation. Moreover, we develop initialization and aggregation rules to integrate both attention weights and token transformation effects across all layers, capturing holistic token contributions throughout the model. Experimental results on segmentation and perturbation tests demonstrate the superiority of our proposed TokenTM compared to state-of-the-art Vision Transformer explanation methods.

Related papers

Token Transforming: A Unified and Training-Free Token Compression Framework for Vision Transformer Acceleration [8.584066042703972]
We propose a many-to-many Token Transforming framework that serves as a generalization of all existing methods.<n>Specifically, we reduce 40% FLOPs and accelerate DeiT-S by $times$1.5 with marginal 0.1% accuracy drop.<n>We extend the method to dense prediction tasks including segmentation, object detection, depth estimation, and language model generation.
arXiv Detail & Related papers (2025-06-06T03:18:11Z)
Simplifying Graph Transformers [64.50059165186701]
We propose three simple modifications to the plain Transformer to render it applicable to graphs without introducing major architectural distortions. Specifically, we advocate for the use of (1) simplified $L$ attention to measure the magnitude of closeness tokens; (2) adaptive root-mean-square normalization to preserve token magnitude information; and (3) a relative positional encoding bias with a shared encoder.
arXiv Detail & Related papers (2025-04-17T02:06:50Z)
Inspecting Explainability of Transformer Models with Additional Statistical Information [27.04589064942369]
Chefer et al. can visualize the Transformer on vision and multi-modal tasks effectively by combining attention layers to show the importance of each image patch. However, when applying to other variants of Transformer such as the Swin Transformer, this method can not focus on the predicted object. Our method, by considering the statistics of tokens in layer normalization layers, shows a great ability to interpret the explainability of Swin Transformer and ViT.
arXiv Detail & Related papers (2023-11-19T17:22:50Z)
ExpPoint-MAE: Better interpretability and performance for self-supervised point cloud transformers [7.725095281624494]
We evaluate the effectiveness of Masked Autoencoding as a pretraining scheme, and explore Momentum Contrast as an alternative. We observe that the transformer learns to attend to semantically meaningful regions, indicating that pretraining leads to a better understanding of the underlying geometry.
arXiv Detail & Related papers (2023-06-19T09:38:21Z)
VISIT: Visualizing and Interpreting the Semantic Information Flow of Transformers [45.42482446288144]
Recent advances in interpretability suggest we can project weights and hidden states of transformer-based language models to their vocabulary. We investigate LM attention heads and memory values, the vectors the models dynamically create and recall while processing a given input. We create a tool to visualize a forward pass of Generative Pre-trained Transformers (GPTs) as an interactive flow graph.
arXiv Detail & Related papers (2023-05-22T19:04:56Z)
ViT-Calibrator: Decision Stream Calibration for Vision Transformer [49.60474757318486]
We propose a new paradigm dubbed Decision Stream that boosts the performance of general Vision Transformers. We shed light on the information propagation mechanism in the learning procedure by exploring the correlation between different tokens and the relevance coefficient of multiple dimensions.
arXiv Detail & Related papers (2023-04-10T02:40:24Z)
Holistically Explainable Vision Transformers [136.27303006772294]
We propose B-cos transformers, which inherently provide holistic explanations for their decisions. Specifically, we formulate each model component - such as the multi-layer perceptrons, attention layers, and the tokenisation module - to be dynamic linear. We apply our proposed design to Vision Transformers (ViTs) and show that the resulting models, dubbed Bcos-ViTs, are highly interpretable and perform competitively to baseline ViTs.
arXiv Detail & Related papers (2023-01-20T16:45:34Z)
What Makes for Good Tokenizers in Vision Transformer? [62.44987486771936]
transformers are capable of extracting their pairwise relationships using self-attention. What makes for a good tokenizer has not been well understood in computer vision. Modulation across Tokens (MoTo) incorporates inter-token modeling capability through normalization. Regularization objective TokenProp is embraced in the standard training regime.
arXiv Detail & Related papers (2022-12-21T15:51:43Z)
Visualizing and Understanding Patch Interactions in Vision Transformer [96.70401478061076]
Vision Transformer (ViT) has become a leading tool in various computer vision tasks. We propose a novel explainable visualization approach to analyze and interpret the crucial attention interactions among patches for vision transformer.
arXiv Detail & Related papers (2022-03-11T13:48:11Z)
Rethinking Global Context in Crowd Counting [70.54184500538338]
A pure transformer is used to extract features with global information from overlapping image patches. Inspired by classification, we add a context token to the input sequence, to facilitate information exchange with tokens corresponding to image patches.
arXiv Detail & Related papers (2021-05-23T12:44:27Z)
Generic Attention-model Explainability for Interpreting Bi-Modal and Encoder-Decoder Transformers [78.26411729589526]
We propose the first method to explain prediction by any Transformer-based architecture. Our method is superior to all existing methods which are adapted from single modality explainability.
arXiv Detail & Related papers (2021-03-29T15:03:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.