Fugu-MT 論文翻訳(概要): VL-InterpreT: An Interactive Visualization Tool for Interpreting Vision-Language Transformers

論文の概要: VL-InterpreT: An Interactive Visualization Tool for Interpreting Vision-Language Transformers

arxiv url: http://arxiv.org/abs/2203.17247v1
Date: Wed, 30 Mar 2022 05:25:35 GMT
ステータス: 翻訳完了
システム内更新日: 2022-04-02 14:22:02.234832
Title: VL-InterpreT: An Interactive Visualization Tool for Interpreting Vision-Language Transformers
Title（参考訳）: VL-InterpreT:視覚言語変換器の対話型可視化ツール
Authors: Estelle Aflalo, Meng Du, Shao-Yen Tseng, Yongfei Liu, Chenfei Wu, Nan Duan, Vasudev Lal
Abstract要約: 視覚とマルチモーダル変換器の内部機構はほとんど不透明である。これらの変圧器の成功により、その内部動作を理解することがますます重要になっている。マルチモーダルトランスにおける注目や隠された表現を解釈するための対話型可視化を提供するVL-InterpreTを提案する。
参考スコア（独自算出の注目度）: 47.581265194864585
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Breakthroughs in transformer-based models have revolutionized not only the NLP field, but also vision and multimodal systems. However, although visualization and interpretability tools have become available for NLP models, internal mechanisms of vision and multimodal transformers remain largely opaque. With the success of these transformers, it is increasingly critical to understand their inner workings, as unraveling these black-boxes will lead to more capable and trustworthy models. To contribute to this quest, we propose VL-InterpreT, which provides novel interactive visualizations for interpreting the attentions and hidden representations in multimodal transformers. VL-InterpreT is a task agnostic and integrated tool that (1) tracks a variety of statistics in attention heads throughout all layers for both vision and language components, (2) visualizes cross-modal and intra-modal attentions through easily readable heatmaps, and (3) plots the hidden representations of vision and language tokens as they pass through the transformer layers. In this paper, we demonstrate the functionalities of VL-InterpreT through the analysis of KD-VLP, an end-to-end pretraining vision-language multimodal transformer-based model, in the tasks of Visual Commonsense Reasoning (VCR) and WebQA, two visual question answering benchmarks. Furthermore, we also present a few interesting findings about multimodal transformer behaviors that were learned through our tool.
Abstract（参考訳）: トランスモデルにおけるブレークスルーは、NLPフィールドだけでなく、ビジョンやマルチモーダルシステムにも革命をもたらした。しかしながら、NLPモデルでは可視化と解釈可能性ツールが利用可能になっているが、視覚とマルチモーダルトランスフォーマーの内部メカニズムはほとんど不透明である。これらのトランスフォーマーの成功により、ブラックボックスを解き放つことでより有能で信頼できるモデルが生まれるため、内部の動作を理解することがますます重要になる。この探索に寄与するために,マルチモーダルトランスフォーマーにおける注目や隠された表現を解釈するためのインタラクティブな可視化を提供するVL-InterpreTを提案する。 VL-InterpreTはタスクに依存しない統合されたツールであり、(1)視覚と言語コンポーネントの両方の全てのレイヤにおける注意の様々な統計をトラックし、(2)読みやすいヒートマップを通してモダクタルとモダクタルの注意を可視化し、(3)トランスフォーマー層を通過するときに視覚と言語トークンの隠れた表現をプロットする。本稿では,視覚言語多モードトランスフォーマーモデルkd-vlpの分析を通して,視覚コモンセンス推論 (vcr) と webqa の2つの視覚的質問応答ベンチマークを用いて,vl 解釈の機能を実証する。さらに,本ツールで得られたマルチモーダルトランスフォーマーの挙動について,いくつかの興味深い知見を示す。

論文の概要: VL-InterpreT: An Interactive Visualization Tool for Interpreting Vision-Language Transformers

関連論文リスト