VL-InterpreT: An Interactive Visualization Tool for Interpreting
Vision-Language Transformers
- URL: http://arxiv.org/abs/2203.17247v1
- Date: Wed, 30 Mar 2022 05:25:35 GMT
- Title: VL-InterpreT: An Interactive Visualization Tool for Interpreting
Vision-Language Transformers
- Authors: Estelle Aflalo, Meng Du, Shao-Yen Tseng, Yongfei Liu, Chenfei Wu, Nan
Duan, Vasudev Lal
- Abstract summary: Internal mechanisms of vision and multimodal transformers remain largely opaque.
With the success of these transformers, it is increasingly critical to understand their inner workings.
We propose VL-InterpreT, which provides novel interactive visualizations for interpreting the attentions and hidden representations in multimodal transformers.
- Score: 47.581265194864585
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Breakthroughs in transformer-based models have revolutionized not only the
NLP field, but also vision and multimodal systems. However, although
visualization and interpretability tools have become available for NLP models,
internal mechanisms of vision and multimodal transformers remain largely
opaque. With the success of these transformers, it is increasingly critical to
understand their inner workings, as unraveling these black-boxes will lead to
more capable and trustworthy models. To contribute to this quest, we propose
VL-InterpreT, which provides novel interactive visualizations for interpreting
the attentions and hidden representations in multimodal transformers.
VL-InterpreT is a task agnostic and integrated tool that (1) tracks a variety
of statistics in attention heads throughout all layers for both vision and
language components, (2) visualizes cross-modal and intra-modal attentions
through easily readable heatmaps, and (3) plots the hidden representations of
vision and language tokens as they pass through the transformer layers. In this
paper, we demonstrate the functionalities of VL-InterpreT through the analysis
of KD-VLP, an end-to-end pretraining vision-language multimodal
transformer-based model, in the tasks of Visual Commonsense Reasoning (VCR) and
WebQA, two visual question answering benchmarks. Furthermore, we also present a
few interesting findings about multimodal transformer behaviors that were
learned through our tool.
Related papers
- AttentionViz: A Global View of Transformer Attention [60.82904477362676]
We present a new visualization technique designed to help researchers understand the self-attention mechanism in transformers.
The main idea behind our method is to visualize a joint embedding of the query and key vectors used by transformer models to compute attention.
We create an interactive visualization tool, AttentionViz, based on these joint query-key embeddings.
arXiv Detail & Related papers (2023-05-04T23:46:49Z) - Holistically Explainable Vision Transformers [136.27303006772294]
We propose B-cos transformers, which inherently provide holistic explanations for their decisions.
Specifically, we formulate each model component - such as the multi-layer perceptrons, attention layers, and the tokenisation module - to be dynamic linear.
We apply our proposed design to Vision Transformers (ViTs) and show that the resulting models, dubbed Bcos-ViTs, are highly interpretable and perform competitively to baseline ViTs.
arXiv Detail & Related papers (2023-01-20T16:45:34Z) - Probing Inter-modality: Visual Parsing with Self-Attention for
Vision-Language Pre-training [139.4566371416662]
Vision-Language Pre-training aims to learn multi-modal representations from image-text pairs.
CNNs have limitations in visual relation learning due to local receptive field's weakness in modeling long-range dependencies.
arXiv Detail & Related papers (2021-06-25T08:04:25Z) - ViTAE: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias [76.16156833138038]
We propose a novel Vision Transformer Advanced by Exploring intrinsic IB from convolutions, ie, ViTAE.
ViTAE has several spatial pyramid reduction modules to downsample and embed the input image into tokens with rich multi-scale context.
In each transformer layer, ViTAE has a convolution block in parallel to the multi-head self-attention module, whose features are fused and fed into the feed-forward network.
arXiv Detail & Related papers (2021-06-07T05:31:06Z) - Transformers in Vision: A Survey [101.07348618962111]
Transformers enable modeling long dependencies between input sequence elements and support parallel processing of sequence.
Transformers require minimal inductive biases for their design and are naturally suited as set-functions.
This survey aims to provide a comprehensive overview of the Transformer models in the computer vision discipline.
arXiv Detail & Related papers (2021-01-04T18:57:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.