Guiding Visual Question Answering with Attention Priors
- URL: http://arxiv.org/abs/2205.12616v1
- Date: Wed, 25 May 2022 09:53:47 GMT
- Title: Guiding Visual Question Answering with Attention Priors
- Authors: Thao Minh Le, Vuong Le, Sunil Gupta, Svetha Venkatesh, Truyen Tran
- Abstract summary: We propose to guide the attention mechanism using explicit linguistic-visual grounding.
This grounding is derived by connecting structured linguistic concepts in the query to their referents among the visual objects.
The resultant algorithm is capable of probing attention-based reasoning models, injecting relevant associative knowledge, and regulating the core reasoning process.
- Score: 76.21671164766073
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The current success of modern visual reasoning systems is arguably attributed
to cross-modality attention mechanisms. However, in deliberative reasoning such
as in VQA, attention is unconstrained at each step, and thus may serve as a
statistical pooling mechanism rather than a semantic operation intended to
select information relevant to inference. This is because at training time,
attention is only guided by a very sparse signal (i.e. the answer label) at the
end of the inference chain. This causes the cross-modality attention weights to
deviate from the desired visual-language bindings. To rectify this deviation,
we propose to guide the attention mechanism using explicit linguistic-visual
grounding. This grounding is derived by connecting structured linguistic
concepts in the query to their referents among the visual objects. Here we
learn the grounding from the pairing of questions and images alone, without the
need for answer annotation or external grounding supervision. This grounding
guides the attention mechanism inside VQA models through a duality of
mechanisms: pre-training attention weight calculation and directly guiding the
weights at inference time on a case-by-case basis. The resultant algorithm is
capable of probing attention-based reasoning models, injecting relevant
associative knowledge, and regulating the core reasoning process. This scalable
enhancement improves the performance of VQA models, fortifies their robustness
to limited access to supervised data, and increases interpretability.
Related papers
- Interpretable Visual Question Answering via Reasoning Supervision [4.76359068115052]
Transformer-based architectures have recently demonstrated remarkable performance in the Visual Question Answering (VQA) task.
We propose a novel architecture for visual question answering that leverages common sense reasoning as a supervisory signal.
We demonstrate both quantitatively and qualitatively that the proposed approach can boost the model's visual perception capability and lead to performance increase.
arXiv Detail & Related papers (2023-09-07T14:12:31Z) - Dynamic Clue Bottlenecks: Towards Interpretable-by-Design Visual Question Answering [58.64831511644917]
We introduce an interpretable by design model that factors model decisions into intermediate human-legible explanations.
We show that our inherently interpretable system can improve 4.64% over a comparable black-box system in reasoning-focused questions.
arXiv Detail & Related papers (2023-05-24T08:33:15Z) - Revisiting Attention Weights as Explanations from an Information
Theoretic Perspective [4.499369811647602]
We show that attention mechanisms have the potential to function as a shortcut to model explanations when they are carefully combined with other model elements.
Our findings indicate that attention mechanisms do have the potential to function as a shortcut to model explanations when they are carefully combined with other model elements.
arXiv Detail & Related papers (2022-10-31T12:53:20Z) - Attention in Reasoning: Dataset, Analysis, and Modeling [31.3104693230952]
We propose an Attention with Reasoning capability (AiR) framework that uses attention to understand and improve the process leading to task outcomes.
We first define an evaluation metric based on a sequence of atomic reasoning operations, enabling a quantitative measurement of attention.
We then collect human eye-tracking and answer correctness data, and analyze various machine and human attention mechanisms on their reasoning capability.
arXiv Detail & Related papers (2022-04-20T20:32:31Z) - Alignment Attention by Matching Key and Query Distributions [48.93793773929006]
This paper introduces alignment attention that explicitly encourages self-attention to match the distributions of the key and query within each head.
It is simple to convert any models with self-attention, including pre-trained ones, to the proposed alignment attention.
On a variety of language understanding tasks, we show the effectiveness of our method in accuracy, uncertainty estimation, generalization across domains, and robustness to adversarial attacks.
arXiv Detail & Related papers (2021-10-25T00:54:57Z) - SparseBERT: Rethinking the Importance Analysis in Self-attention [107.68072039537311]
Transformer-based models are popular for natural language processing (NLP) tasks due to its powerful capacity.
Attention map visualization of a pre-trained model is one direct method for understanding self-attention mechanism.
We propose a Differentiable Attention Mask (DAM) algorithm, which can be also applied in guidance of SparseBERT design.
arXiv Detail & Related papers (2021-02-25T14:13:44Z) - Object-Centric Diagnosis of Visual Reasoning [118.36750454795428]
This paper presents a systematical object-centric diagnosis of visual reasoning on grounding and robustness.
We develop a diagnostic model, namely Graph Reasoning Machine.
Our model replaces purely symbolic visual representation with probabilistic scene graph and then applies teacher-forcing training for the visual reasoning module.
arXiv Detail & Related papers (2020-12-21T18:59:28Z) - Neuro-Symbolic Visual Reasoning: Disentangling "Visual" from "Reasoning" [49.76230210108583]
We propose a framework to isolate and evaluate the reasoning aspect of visual question answering (VQA) separately from its perception.
We also propose a novel top-down calibration technique that allows the model to answer reasoning questions even with imperfect perception.
On the challenging GQA dataset, this framework is used to perform in-depth, disentangled comparisons between well-known VQA models.
arXiv Detail & Related papers (2020-06-20T08:48:29Z) - Why Attentions May Not Be Interpretable? [46.69116768203185]
Recent research found that attention-as-importance interpretations often do not work as we expected.
We show that one root cause of this phenomenon is shortcuts, which means that the attention weights themselves may carry extra information.
We propose two methods to mitigate this issue.
arXiv Detail & Related papers (2020-06-10T05:08:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.