ViT-ReciproCAM: Gradient and Attention-Free Visual Explanations for
Vision Transformer
- URL: http://arxiv.org/abs/2310.02588v1
- Date: Wed, 4 Oct 2023 05:09:50 GMT
- Title: ViT-ReciproCAM: Gradient and Attention-Free Visual Explanations for
Vision Transformer
- Authors: Seok-Yong Byun, Wonju Lee
- Abstract summary: Vision Transformers (ViT) have demonstrated superior performance in various computer vision tasks such as image classification and object detection.
Current state-of-the-art solutions for ViT rely on class Attention-Rollout and Relevance techniques.
We propose a new gradient-free visual explanation method for ViT, called ViT-ReciproCAM, which does not require attention matrix and gradient information.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: This paper presents a novel approach to address the challenges of
understanding the prediction process and debugging prediction errors in Vision
Transformers (ViT), which have demonstrated superior performance in various
computer vision tasks such as image classification and object detection. While
several visual explainability techniques, such as CAM, Grad-CAM, Score-CAM, and
Recipro-CAM, have been extensively researched for Convolutional Neural Networks
(CNNs), limited research has been conducted on ViT. Current state-of-the-art
solutions for ViT rely on class agnostic Attention-Rollout and Relevance
techniques. In this work, we propose a new gradient-free visual explanation
method for ViT, called ViT-ReciproCAM, which does not require attention matrix
and gradient information. ViT-ReciproCAM utilizes token masking and generated
new layer outputs from the target layer's input to exploit the correlation
between activated tokens and network predictions for target classes. Our
proposed method outperforms the state-of-the-art Relevance method in the
Average Drop-Coherence-Complexity (ADCC) metric by $4.58\%$ to $5.80\%$ and
generates more localized saliency maps. Our experiments demonstrate the
effectiveness of ViT-ReciproCAM and showcase its potential for understanding
and debugging ViT models. Our proposed method provides an efficient and
easy-to-implement alternative for generating visual explanations, without
requiring attention and gradient information, which can be beneficial for
various applications in the field of computer vision.
Related papers
- LW-DETR: A Transformer Replacement to YOLO for Real-Time Detection [63.780355815743135]
We present a light-weight detection transformer, LW-DETR, which outperforms YOLOs for real-time object detection.
The architecture is a simple stack of a ViT encoder, a projector, and a shallow DETR decoder.
arXiv Detail & Related papers (2024-06-05T17:07:24Z) - Attention Guided CAM: Visual Explanations of Vision Transformer Guided
by Self-Attention [2.466595763108917]
We propose an attention-guided visualization method applied to ViT that provides a high-level semantic explanation for its decision.
Our method provides elaborate high-level semantic explanations with great localization performance only with the class labels.
arXiv Detail & Related papers (2024-02-07T03:43:56Z) - Denoising Vision Transformers [43.03068202384091]
We propose a two-stage denoising approach, termed Denoising Vision Transformers (DVT)
In the first stage, we separate the clean features from those contaminated by positional artifacts by enforcing cross-view feature consistency with neural fields on a per-image basis.
In the second stage, we train a lightweight transformer block to predict clean features from raw ViT outputs, leveraging the derived estimates of the clean features as supervision.
arXiv Detail & Related papers (2024-01-05T18:59:52Z) - Explainable Multi-Camera 3D Object Detection with Transformer-Based
Saliency Maps [0.0]
Vision Transformers (ViTs) have achieved state-of-the-art results on various computer vision tasks, including 3D object detection.
End-to-end implementation makes ViTs less explainable, which can be a challenge for deploying them in safety-critical applications.
We propose a novel method for generating saliency maps for a DetR-like ViT with multiple camera inputs used for 3D object detection.
arXiv Detail & Related papers (2023-12-22T11:03:12Z) - Vision Transformers Need Registers [26.63912173005165]
We identify and characterize artifacts in feature maps of both supervised and self-supervised ViT networks.
We show that this solution fixes that problem entirely for both supervised and self-supervised models.
arXiv Detail & Related papers (2023-09-28T16:45:46Z) - What do Vision Transformers Learn? A Visual Exploration [68.50771218442776]
Vision transformers (ViTs) are quickly becoming the de-facto architecture for computer vision.
This paper addresses the obstacles to performing visualizations on ViTs and explores the underlying differences between ViTs and CNNs.
We also conduct large-scale visualizations on a range of ViT variants, including DeiT, CoaT, ConViT, PiT, Swin, and Twin.
arXiv Detail & Related papers (2022-12-13T16:55:12Z) - Visualizing and Understanding Patch Interactions in Vision Transformer [96.70401478061076]
Vision Transformer (ViT) has become a leading tool in various computer vision tasks.
We propose a novel explainable visualization approach to analyze and interpret the crucial attention interactions among patches for vision transformer.
arXiv Detail & Related papers (2022-03-11T13:48:11Z) - AdaViT: Adaptive Tokens for Efficient Vision Transformer [91.88404546243113]
We introduce AdaViT, a method that adaptively adjusts the inference cost of vision transformer (ViT) for images of different complexity.
AdaViT achieves this by automatically reducing the number of tokens in vision transformers that are processed in the network as inference proceeds.
arXiv Detail & Related papers (2021-12-14T18:56:07Z) - PreViTS: Contrastive Pretraining with Video Tracking Supervision [53.73237606312024]
PreViTS is an unsupervised SSL framework for selecting clips containing the same object.
PreViTS spatially constrains the frame regions to learn from and trains the model to locate meaningful objects.
We train a momentum contrastive (MoCo) encoder on VGG-Sound and Kinetics-400 datasets with PreViTS.
arXiv Detail & Related papers (2021-12-01T19:49:57Z) - Visual Saliency Transformer [127.33678448761599]
We develop a novel unified model based on a pure transformer, Visual Saliency Transformer (VST), for both RGB and RGB-D salient object detection (SOD)
It takes image patches as inputs and leverages the transformer to propagate global contexts among image patches.
Experimental results show that our model outperforms existing state-of-the-art results on both RGB and RGB-D SOD benchmark datasets.
arXiv Detail & Related papers (2021-04-25T08:24:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.