Interpretability-Aware Vision Transformer
- URL: http://arxiv.org/abs/2309.08035v1
- Date: Thu, 14 Sep 2023 21:50:49 GMT
- Title: Interpretability-Aware Vision Transformer
- Authors: Yao Qiang, Chengyin Li, Prashant Khanduri and Dongxiao Zhu
- Abstract summary: Vision Transformers (ViTs) have become prominent models for solving various vision tasks.
We introduce a novel training procedure that inherently enhances model interpretability.
IA-ViT is composed of a feature extractor, a predictor, and an interpreter, which are trained jointly with an interpretability-aware training objective.
- Score: 13.310757078491916
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Vision Transformers (ViTs) have become prominent models for solving various
vision tasks. However, the interpretability of ViTs has not kept pace with
their promising performance. While there has been a surge of interest in
developing {\it post hoc} solutions to explain ViTs' outputs, these methods do
not generalize to different downstream tasks and various transformer
architectures. Furthermore, if ViTs are not properly trained with the given
data and do not prioritize the region of interest, the {\it post hoc} methods
would be less effective. Instead of developing another {\it post hoc} approach,
we introduce a novel training procedure that inherently enhances model
interpretability. Our interpretability-aware ViT (IA-ViT) draws inspiration
from a fresh insight: both the class patch and image patches consistently
generate predicted distributions and attention maps. IA-ViT is composed of a
feature extractor, a predictor, and an interpreter, which are trained jointly
with an interpretability-aware training objective. Consequently, the
interpreter simulates the behavior of the predictor and provides a faithful
explanation through its single-head self-attention mechanism. Our comprehensive
experimental results demonstrate the effectiveness of IA-ViT in several image
classification tasks, with both qualitative and quantitative evaluations of
model performance and interpretability. Source code is available from:
https://github.com/qiangyao1988/IA-ViT.
Related papers
- Beyond the Doors of Perception: Vision Transformers Represent Relations Between Objects [30.09778169168547]
Vision transformers (ViTs) have achieved state-of-the-art performance in a variety of settings.
However, they exhibit surprising failures when performing tasks involving visual relations.
arXiv Detail & Related papers (2024-06-22T22:43:10Z) - Attention Guided CAM: Visual Explanations of Vision Transformer Guided
by Self-Attention [2.466595763108917]
We propose an attention-guided visualization method applied to ViT that provides a high-level semantic explanation for its decision.
Our method provides elaborate high-level semantic explanations with great localization performance only with the class labels.
arXiv Detail & Related papers (2024-02-07T03:43:56Z) - What Makes Pre-Trained Visual Representations Successful for Robust
Manipulation? [57.92924256181857]
We find that visual representations designed for manipulation and control tasks do not necessarily generalize under subtle changes in lighting and scene texture.
We find that emergent segmentation ability is a strong predictor of out-of-distribution generalization among ViT models.
arXiv Detail & Related papers (2023-11-03T18:09:08Z) - Understanding Gaussian Attention Bias of Vision Transformers Using
Effective Receptive Fields [7.58745191859815]
Vision transformers (ViTs) that model an image as a sequence of partitioned patches have shown notable performance in diverse vision tasks.
We propose explicitly adding a Gaussian attention bias that guides the positional embedding to have the corresponding pattern from the beginning of training.
The results showed that proposed method not only facilitates ViTs to understand images but also boosts their performance on various datasets.
arXiv Detail & Related papers (2023-05-08T14:12:25Z) - What do Vision Transformers Learn? A Visual Exploration [68.50771218442776]
Vision transformers (ViTs) are quickly becoming the de-facto architecture for computer vision.
This paper addresses the obstacles to performing visualizations on ViTs and explores the underlying differences between ViTs and CNNs.
We also conduct large-scale visualizations on a range of ViT variants, including DeiT, CoaT, ConViT, PiT, Swin, and Twin.
arXiv Detail & Related papers (2022-12-13T16:55:12Z) - Task Formulation Matters When Learning Continually: A Case Study in
Visual Question Answering [58.82325933356066]
Continual learning aims to train a model incrementally on a sequence of tasks without forgetting previous knowledge.
We present a detailed study of how different settings affect performance for Visual Question Answering.
arXiv Detail & Related papers (2022-09-30T19:12:58Z) - Patch-level Representation Learning for Self-supervised Vision
Transformers [68.8862419248863]
Vision Transformers (ViTs) have gained much attention recently as a better architectural choice, often outperforming convolutional networks for various visual tasks.
Inspired by this, we design a simple yet effective visual pretext task, coined SelfPatch, for learning better patch-level representations.
We demonstrate that SelfPatch can significantly improve the performance of existing SSL methods for various visual tasks.
arXiv Detail & Related papers (2022-06-16T08:01:19Z) - Understanding and Improving Robustness of Vision Transformers through
Patch-based Negative Augmentation [29.08732248577141]
We investigate the robustness of vision transformers (ViTs) through the lens of their special patch-based architectural structure.
We find that ViTs are surprisingly insensitive to patch-based transformations, even when the transformation largely destroys the original semantics.
We show that patch-based negative augmentation consistently improves robustness of ViTs across a wide set of ImageNet based robustness benchmarks.
arXiv Detail & Related papers (2021-10-15T04:53:18Z) - Intriguing Properties of Vision Transformers [114.28522466830374]
Vision transformers (ViT) have demonstrated impressive performance across various machine vision problems.
We systematically study this question via an extensive set of experiments and comparisons with a high-performing convolutional neural network (CNN)
We show effective features of ViTs are due to flexible receptive and dynamic fields possible via the self-attention mechanism.
arXiv Detail & Related papers (2021-05-21T17:59:18Z) - Vision Transformers are Robust Learners [65.91359312429147]
We study the robustness of the Vision Transformer (ViT) against common corruptions and perturbations, distribution shifts, and natural adversarial examples.
We present analyses that provide both quantitative and qualitative indications to explain why ViTs are indeed more robust learners.
arXiv Detail & Related papers (2021-05-17T02:39:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.