eX-ViT: A Novel eXplainable Vision Transformer for Weakly Supervised
Semantic Segmentation
- URL: http://arxiv.org/abs/2207.05358v1
- Date: Tue, 12 Jul 2022 07:43:29 GMT
- Title: eX-ViT: A Novel eXplainable Vision Transformer for Weakly Supervised
Semantic Segmentation
- Authors: Lu Yu, Wei Xiang, Juan Fang, Yi-Ping Phoebe Chen, Lianhua Chi
- Abstract summary: We propose a novel vision transformer dubbed the eXplainable Vision Transformer (eX-ViT)
eX-ViT is able to jointly discover robust interpretable features and perform the prediction.
- Score: 19.311318200149383
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently vision transformer models have become prominent models for a range
of vision tasks. These models, however, are usually opaque with weak feature
interpretability. Moreover, there is no method currently built for an
intrinsically interpretable transformer, which is able to explain its reasoning
process and provide a faithful explanation. To close these crucial gaps, we
propose a novel vision transformer dubbed the eXplainable Vision Transformer
(eX-ViT), an intrinsically interpretable transformer model that is able to
jointly discover robust interpretable features and perform the prediction.
Specifically, eX-ViT is composed of the Explainable Multi-Head Attention
(E-MHA) module, the Attribute-guided Explainer (AttE) module and the
self-supervised attribute-guided loss. The E-MHA tailors explainable attention
weights that are able to learn semantically interpretable representations from
local patches in terms of model decisions with noise robustness. Meanwhile,
AttE is proposed to encode discriminative attribute features for the target
object through diverse attribute discovery, which constitutes faithful evidence
for the model's predictions. In addition, a self-supervised attribute-guided
loss is developed for our eX-ViT, which aims at learning enhanced
representations through the attribute discriminability mechanism and attribute
diversity mechanism, to localize diverse and discriminative attributes and
generate more robust explanations. As a result, we can uncover faithful and
robust interpretations with diverse attributes through the proposed eX-ViT.
Related papers
- ViTmiX: Vision Transformer Explainability Augmented by Mixed Visualization Methods [1.1650821883155187]
We present a hybrid approach that mixes multiple explainability techniques to enhance interpretability of ViT models.
Our experiments reveal that this hybrid approach significantly improves the interpretability of ViT models compared to individual methods.
To quantify the explainability gain, we introduced a novel post-hoc explainability measure by applying the Pigeonhole principle.
arXiv Detail & Related papers (2024-12-18T18:18:19Z) - The Mechanics of Conceptual Interpretation in GPT Models: Interpretative Insights [10.777646083061395]
We introduce concept editing'', an innovative variation of knowledge editing that uncovers conceptualisation mechanisms within large language models.
We analyse the Multi-Layer Perceptron (MLP), Multi-Head Attention (MHA), and hidden state components of transformer models.
Our work highlights the complex, layered nature of semantic processing in LLMs and the challenges of isolating and modifying specific concepts within these models.
arXiv Detail & Related papers (2024-08-05T18:50:08Z) - Attention Mechanisms Don't Learn Additive Models: Rethinking Feature Importance for Transformers [12.986126243018452]
transformers are structurally incapable of representing linear or additive surrogate models used for feature attribution.
We introduce the Softmax-Linked Additive Log Odds Model (SLALOM), a novel surrogate model specifically designed to align with the transformer framework.
We highlight SLALOM's unique efficiency-quality curve by showing that SLALOM can produce explanations with substantially higher fidelity than competing surrogate models.
arXiv Detail & Related papers (2024-05-22T11:14:00Z) - Uncertainty in latent representations of variational autoencoders optimized for visual tasks [3.9504737666460037]
We investigate the properties of inference in Variational Autoencoders (VAEs)
We draw inspiration from classical computer vision to introduce an inductive bias into the VAE.
We find that restored inference capabilities are delivered by developing a motif in the inference network.
arXiv Detail & Related papers (2024-04-23T16:26:29Z) - Diffexplainer: Towards Cross-modal Global Explanations with Diffusion Models [51.21351775178525]
DiffExplainer is a novel framework that, leveraging language-vision models, enables multimodal global explainability.
It employs diffusion models conditioned on optimized text prompts, synthesizing images that maximize class outputs.
The analysis of generated visual descriptions allows for automatic identification of biases and spurious features.
arXiv Detail & Related papers (2024-04-03T10:11:22Z) - Holistically Explainable Vision Transformers [136.27303006772294]
We propose B-cos transformers, which inherently provide holistic explanations for their decisions.
Specifically, we formulate each model component - such as the multi-layer perceptrons, attention layers, and the tokenisation module - to be dynamic linear.
We apply our proposed design to Vision Transformers (ViTs) and show that the resulting models, dubbed Bcos-ViTs, are highly interpretable and perform competitively to baseline ViTs.
arXiv Detail & Related papers (2023-01-20T16:45:34Z) - SIM-Trans: Structure Information Modeling Transformer for Fine-grained
Visual Categorization [59.732036564862796]
We propose the Structure Information Modeling Transformer (SIM-Trans) to incorporate object structure information into transformer for enhancing discriminative representation learning.
The proposed two modules are light-weighted and can be plugged into any transformer network and trained end-to-end easily.
Experiments and analyses demonstrate that the proposed SIM-Trans achieves state-of-the-art performance on fine-grained visual categorization benchmarks.
arXiv Detail & Related papers (2022-08-31T03:00:07Z) - ViTAEv2: Vision Transformer Advanced by Exploring Inductive Bias for
Image Recognition and Beyond [76.35955924137986]
We propose a Vision Transformer Advanced by Exploring intrinsic IB from convolutions, i.e., ViTAE.
ViTAE has several spatial pyramid reduction modules to downsample and embed the input image into tokens with rich multi-scale context.
We obtain the state-of-the-art classification performance, i.e., 88.5% Top-1 classification accuracy on ImageNet validation set and the best 91.2% Top-1 accuracy on ImageNet real validation set.
arXiv Detail & Related papers (2022-02-21T10:40:05Z) - Learning Generative Vision Transformer with Energy-Based Latent Space
for Saliency Prediction [51.80191416661064]
We propose a novel vision transformer with latent variables following an informative energy-based prior for salient object detection.
Both the vision transformer network and the energy-based prior model are jointly trained via Markov chain Monte Carlo-based maximum likelihood estimation.
With the generative vision transformer, we can easily obtain a pixel-wise uncertainty map from an image, which indicates the model confidence in predicting saliency from the image.
arXiv Detail & Related papers (2021-12-27T06:04:33Z) - Generative Counterfactuals for Neural Networks via Attribute-Informed
Perturbation [51.29486247405601]
We design a framework to generate counterfactuals for raw data instances with the proposed Attribute-Informed Perturbation (AIP)
By utilizing generative models conditioned with different attributes, counterfactuals with desired labels can be obtained effectively and efficiently.
Experimental results on real-world texts and images demonstrate the effectiveness, sample quality as well as efficiency of our designed framework.
arXiv Detail & Related papers (2021-01-18T08:37:13Z) - Transformer-based Conditional Variational Autoencoder for Controllable
Story Generation [39.577220559911055]
We investigate large-scale latent variable models (LVMs) for neural story generation with objectives in two threads: generation effectiveness and controllability.
We advocate to revive latent variable modeling, essentially the power of representation learning, in the era of Transformers.
Specifically, we integrate latent representation vectors with a Transformer-based pre-trained architecture to build conditional variational autoencoder (CVAE)
arXiv Detail & Related papers (2021-01-04T08:31:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.