Related papers: Disentangling Visual Transformers: Patch-level Interpretability for Image Classification

Disentangling Visual Transformers: Patch-level Interpretability for Image Classification

URL: http://arxiv.org/abs/2502.17196v2
Date: Thu, 24 Apr 2025 12:21:25 GMT
Title: Disentangling Visual Transformers: Patch-level Interpretability for Image Classification
Authors: Guillaume Jeanneret, Loïc Simon, Frédéric Jurie,
Abstract summary: We propose Hindered Transformer (HiT), a novel interpretable by design architecture inspired by visual transformers.<n>HiT can be interpreted as a linear combination of patch-level information.<n>We show that the advantages of our approach in terms of explicability come with a reasonable trade-off in performance.
Score: 2.899118947717404
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Visual transformers have achieved remarkable performance in image classification tasks, but this performance gain has come at the cost of interpretability. One of the main obstacles to the interpretation of transformers is the self-attention mechanism, which mixes visual information across the whole image in a complex way. In this paper, we propose Hindered Transformer (HiT), a novel interpretable by design architecture inspired by visual transformers. Our proposed architecture rethinks the design of transformers to better disentangle patch influences at the classification stage. Ultimately, HiT can be interpreted as a linear combination of patch-level information. We show that the advantages of our approach in terms of explicability come with a reasonable trade-off in performance, making it an attractive alternative for applications where interpretability is paramount.

Related papers

SwinStyleformer is a favorable choice for image inversion [2.8115030277940947]
This paper proposes the first pure Transformer structure inversion network called SwinStyleformer. Experiments found that the inversion network with the Transformer backbone could not successfully invert the image.
arXiv Detail & Related papers (2024-06-19T02:08:45Z)
Inspecting Explainability of Transformer Models with Additional Statistical Information [27.04589064942369]
Chefer et al. can visualize the Transformer on vision and multi-modal tasks effectively by combining attention layers to show the importance of each image patch. However, when applying to other variants of Transformer such as the Swin Transformer, this method can not focus on the predicted object. Our method, by considering the statistics of tokens in layer normalization layers, shows a great ability to interpret the explainability of Swin Transformer and ViT.
arXiv Detail & Related papers (2023-11-19T17:22:50Z)
Holistically Explainable Vision Transformers [136.27303006772294]
We propose B-cos transformers, which inherently provide holistic explanations for their decisions. Specifically, we formulate each model component - such as the multi-layer perceptrons, attention layers, and the tokenisation module - to be dynamic linear. We apply our proposed design to Vision Transformers (ViTs) and show that the resulting models, dubbed Bcos-ViTs, are highly interpretable and perform competitively to baseline ViTs.
arXiv Detail & Related papers (2023-01-20T16:45:34Z)
What Makes for Good Tokenizers in Vision Transformer? [62.44987486771936]
transformers are capable of extracting their pairwise relationships using self-attention. What makes for a good tokenizer has not been well understood in computer vision. Modulation across Tokens (MoTo) incorporates inter-token modeling capability through normalization. Regularization objective TokenProp is embraced in the standard training regime.
arXiv Detail & Related papers (2022-12-21T15:51:43Z)
Learning Explicit Object-Centric Representations with Vision Transformers [81.38804205212425]
We build on the self-supervision task of masked autoencoding and explore its effectiveness for learning object-centric representations with transformers. We show that the model efficiently learns to decompose simple scenes as measured by segmentation metrics on several multi-object benchmarks.
arXiv Detail & Related papers (2022-10-25T16:39:49Z)
Towards Lightweight Transformer via Group-wise Transformation for Vision-and-Language Tasks [126.33843752332139]
We introduce Group-wise Transformation towards a universal yet lightweight Transformer for vision-and-language tasks, termed as LW-Transformer. We apply LW-Transformer to a set of Transformer-based networks, and quantitatively measure them on three vision-and-language tasks and six benchmark datasets. Experimental results show that while saving a large number of parameters and computations, LW-Transformer achieves very competitive performance against the original Transformer networks for vision-and-language tasks.
arXiv Detail & Related papers (2022-04-16T11:30:26Z)
Visualizing and Understanding Patch Interactions in Vision Transformer [96.70401478061076]
Vision Transformer (ViT) has become a leading tool in various computer vision tasks. We propose a novel explainable visualization approach to analyze and interpret the crucial attention interactions among patches for vision transformer.
arXiv Detail & Related papers (2022-03-11T13:48:11Z)
Exploring and Improving Mobile Level Vision Transformers [81.7741384218121]
We study the vision transformer structure in the mobile level in this paper, and find a dramatic performance drop. We propose a novel irregular patch embedding module and adaptive patch fusion module to improve the performance.
arXiv Detail & Related papers (2021-08-30T06:42:49Z)
Training Vision Transformers for Image Retrieval [32.09708181236154]
We adopt vision transformers for generating image descriptors and train the resulting model with a metric learning objective. Our results show consistent and significant improvements of transformers over convolution-based approaches.
arXiv Detail & Related papers (2021-02-10T18:56:41Z)
A Survey on Visual Transformer [126.56860258176324]
Transformer is a type of deep neural network mainly based on the self-attention mechanism. In this paper, we review these vision transformer models by categorizing them in different tasks and analyzing their advantages and disadvantages.
arXiv Detail & Related papers (2020-12-23T09:37:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.