Fugu-MT 論文翻訳(概要): SQuINTing at VQA Models: Introspecting VQA Models with Sub-Questions

論文の概要: SQuINTing at VQA Models: Introspecting VQA Models with Sub-Questions

arxiv url: http://arxiv.org/abs/2001.06927v2
Date: Tue, 16 Jun 2020 17:54:16 GMT
ステータス: 翻訳完了
システム内更新日: 2023-01-08 04:58:06.374861
Title: SQuINTing at VQA Models: Introspecting VQA Models with Sub-Questions
Title（参考訳）: VQAモデルのSQuINTing:サブクエストによるVQAモデルのイントロスペクション
Authors: Ramprasaath R. Selvaraju, Purva Tendulkar, Devi Parikh, Eric Horvitz, Marco Ribeiro, Besmira Nushi, Ece Kamar
Abstract要約: 現状のVQAモデルでは、知覚や推論の問題に答える上で同等の性能を持つが、一貫性の問題に悩まされていることを示す。この欠点に対処するため、サブクエスト対応ネットワークチューニング(SQuINT)というアプローチを提案する。我々は,SQuINTがモデル一貫性を5%向上し,VQAにおける推論問題の性能も改善し,注意マップも改善したことを示す。
参考スコア（独自算出の注目度）: 66.86887670416193
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Existing VQA datasets contain questions with varying levels of complexity. While the majority of questions in these datasets require perception for recognizing existence, properties, and spatial relationships of entities, a significant portion of questions pose challenges that correspond to reasoning tasks - tasks that can only be answered through a synthesis of perception and knowledge about the world, logic and / or reasoning. Analyzing performance across this distinction allows us to notice when existing VQA models have consistency issues; they answer the reasoning questions correctly but fail on associated low-level perception questions. For example, in Figure 1, models answer the complex reasoning question "Is the banana ripe enough to eat?" correctly, but fail on the associated perception question "Are the bananas mostly green or yellow?" indicating that the model likely answered the reasoning question correctly but for the wrong reason. We quantify the extent to which this phenomenon occurs by creating a new Reasoning split of the VQA dataset and collecting VQA-introspect, a new dataset1 which consists of 238K new perception questions which serve as sub questions corresponding to the set of perceptual tasks needed to effectively answer the complex reasoning questions in the Reasoning split. Our evaluation shows that state-of-the-art VQA models have comparable performance in answering perception and reasoning questions, but suffer from consistency problems. To address this shortcoming, we propose an approach called Sub-Question Importance-aware Network Tuning (SQuINT), which encourages the model to attend to the same parts of the image when answering the reasoning question and the perception sub question. We show that SQuINT improves model consistency by ~5%, also marginally improving performance on the Reasoning questions in VQA, while also displaying better attention maps.
Abstract（参考訳）: 既存のVQAデータセットには、さまざまなレベルの複雑さに関する質問が含まれている。これらのデータセットの質問の多くは、エンティティの存在、特性、空間的関係を認識するための知覚を必要とするが、問題の大部分は推論タスク(世界、論理、あるいは推論に関する認識と知識の合成によってのみ答えられるタスク)に対応する課題を提起する。この区別によって、既存のVQAモデルに一貫性のある問題があることに気付くことができ、推論の疑問に正しく答えるが、関連する低レベルな認識の疑問に失敗する。例えば、図1では、モデルは複雑な推論の質問に答える:「バナナは食べられるのに十分か?」しかし、関連する認識の質問に失敗する:「バナナは主に緑か黄色か? 本稿では,VQAデータセットの新しい推論分割を作成し,VQA-イントロスペクション(VQA-introspect)を収集することにより,この現象が生じる範囲を定量化する。我々の評価によると、最先端のVQAモデルは、知覚と推論の質問に答える上で同等の性能を持つが、一貫性の問題に悩まされている。そこで本研究では,この欠点に対処するために,推論質問と知覚下位質問に答える際に,モデルが画像の同じ部分に参加するように促す,下位質問重要度認識ネットワークチューニング(sub-question importance-aware network tuning,squint)と呼ばれるアプローチを提案する。我々は,SQuINTがモデル一貫性を約5%向上し,VQAにおける推論問題の性能も改善し,注意マップも改善したことを示す。

論文の概要: SQuINTing at VQA Models: Introspecting VQA Models with Sub-Questions

関連論文リスト