VISION DIFFMASK: Faithful Interpretation of Vision Transformers with
Differentiable Patch Masking
- URL: http://arxiv.org/abs/2304.06391v1
- Date: Thu, 13 Apr 2023 10:49:26 GMT
- Title: VISION DIFFMASK: Faithful Interpretation of Vision Transformers with
Differentiable Patch Masking
- Authors: Angelos Nalmpantis, Apostolos Panagiotopoulos, John Gkountouras,
Konstantinos Papakostas and Wilker Aziz
- Abstract summary: We propose a post-hoc interpretability method called VISION DIFFMASK.
It uses the activations of the model's hidden layers to predict the relevant parts of the input that contribute to its final predictions.
Our approach uses a gating mechanism to identify the minimal subset of the original input that preserves the predicted distribution over classes.
- Score: 10.345616883018296
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The lack of interpretability of the Vision Transformer may hinder its use in
critical real-world applications despite its effectiveness. To overcome this
issue, we propose a post-hoc interpretability method called VISION DIFFMASK,
which uses the activations of the model's hidden layers to predict the relevant
parts of the input that contribute to its final predictions. Our approach uses
a gating mechanism to identify the minimal subset of the original input that
preserves the predicted distribution over classes. We demonstrate the
faithfulness of our method, by introducing a faithfulness task, and comparing
it to other state-of-the-art attribution methods on CIFAR-10 and ImageNet-1K,
achieving compelling results. To aid reproducibility and further extension of
our work, we open source our implementation:
https://github.com/AngelosNal/Vision-DiffMask
Related papers
- RITUAL: Random Image Transformations as a Universal Anti-hallucination Lever in LVLMs [16.185253476874006]
We propose a simple, training-free method termed RITUAL to enhance robustness against hallucinations in LVLMs.
Our approach employs random image transformations as complements to the original probability distribution.
Our empirical results show that while the isolated use of transformed images initially degrades performance, strategic implementation of these transformations can indeed serve as effective complements.
arXiv Detail & Related papers (2024-05-28T04:41:02Z) - Debiasing Multimodal Large Language Models [61.6896704217147]
Large Vision-Language Models (LVLMs) have become indispensable tools in computer vision and natural language processing.
Our investigation reveals a noteworthy bias in the generated content, where the output is primarily influenced by the underlying Large Language Models (LLMs) prior to the input image.
To rectify these biases and redirect the model's focus toward vision information, we introduce two simple, training-free strategies.
arXiv Detail & Related papers (2024-03-08T12:35:07Z) - RadOcc: Learning Cross-Modality Occupancy Knowledge through Rendering
Assisted Distillation [50.35403070279804]
3D occupancy prediction is an emerging task that aims to estimate the occupancy states and semantics of 3D scenes using multi-view images.
We propose RadOcc, a Rendering assisted distillation paradigm for 3D Occupancy prediction.
arXiv Detail & Related papers (2023-12-19T03:39:56Z) - Interpretability-Aware Vision Transformer [13.310757078491916]
Vision Transformers (ViTs) have become prominent models for solving various vision tasks.
We introduce a novel training procedure that inherently enhances model interpretability.
IA-ViT is composed of a feature extractor, a predictor, and an interpreter, which are trained jointly with an interpretability-aware training objective.
arXiv Detail & Related papers (2023-09-14T21:50:49Z) - Towards General Visual-Linguistic Face Forgery Detection [95.73987327101143]
Deepfakes are realistic face manipulations that can pose serious threats to security, privacy, and trust.
Existing methods mostly treat this task as binary classification, which uses digital labels or mask signals to train the detection model.
We propose a novel paradigm named Visual-Linguistic Face Forgery Detection(VLFFD), which uses fine-grained sentence-level prompts as the annotation.
arXiv Detail & Related papers (2023-07-31T10:22:33Z) - Learning to Mask and Permute Visual Tokens for Vision Transformer
Pre-Training [59.923672191632065]
We propose a new self-supervised pre-training approach, named Masked and Permuted Vision Transformer (MaPeT)
MaPeT employs autoregressive and permuted predictions to capture intra-patch dependencies.
Our results demonstrate that MaPeT achieves competitive performance on ImageNet.
arXiv Detail & Related papers (2023-06-12T18:12:19Z) - Single-round Self-supervised Distributed Learning using Vision
Transformer [34.76985278888513]
We propose a self-supervised masked sampling distillation method for the vision transformer.
This method can be implemented without continuous communication and can enhance privacy by utilizing a vision transformer-specific encryption technique.
arXiv Detail & Related papers (2023-01-05T13:47:36Z) - Visualizing and Understanding Patch Interactions in Vision Transformer [96.70401478061076]
Vision Transformer (ViT) has become a leading tool in various computer vision tasks.
We propose a novel explainable visualization approach to analyze and interpret the crucial attention interactions among patches for vision transformer.
arXiv Detail & Related papers (2022-03-11T13:48:11Z) - Self-supervised Equivariant Attention Mechanism for Weakly Supervised
Semantic Segmentation [93.83369981759996]
We propose a self-supervised equivariant attention mechanism (SEAM) to discover additional supervision and narrow the gap.
Our method is based on the observation that equivariance is an implicit constraint in fully supervised semantic segmentation.
We propose consistency regularization on predicted CAMs from various transformed images to provide self-supervision for network learning.
arXiv Detail & Related papers (2020-04-09T14:57:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.