Knowledge-Based Counterfactual Queries for Visual Question Answering
- URL: http://arxiv.org/abs/2303.02601v1
- Date: Sun, 5 Mar 2023 08:00:30 GMT
- Title: Knowledge-Based Counterfactual Queries for Visual Question Answering
- Authors: Theodoti Stoikou, Maria Lymperaiou, Giorgos Stamou
- Abstract summary: We propose a systematic method for explaining the behavior and investigating the robustness of VQA models through counterfactual perturbations.
For this reason, we exploit structured knowledge bases to perform deterministic, optimal and controllable word-level replacements targeting the linguistic modality.
We then evaluate the model's response against such counterfactual inputs.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Visual Question Answering (VQA) has been a popular task that combines vision
and language, with numerous relevant implementations in literature. Even though
there are some attempts that approach explainability and robustness issues in
VQA models, very few of them employ counterfactuals as a means of probing such
challenges in a model-agnostic way. In this work, we propose a systematic
method for explaining the behavior and investigating the robustness of VQA
models through counterfactual perturbations. For this reason, we exploit
structured knowledge bases to perform deterministic, optimal and controllable
word-level replacements targeting the linguistic modality, and we then evaluate
the model's response against such counterfactual inputs. Finally, we
qualitatively extract local and global explanations based on counterfactual
responses, which are ultimately proven insightful towards interpreting VQA
model behaviors. By performing a variety of perturbation types, targeting
different parts of speech of the input question, we gain insights to the
reasoning of the model, through the comparison of its responses in different
adversarial circumstances. Overall, we reveal possible biases in the
decision-making process of the model, as well as expected and unexpected
patterns, which impact its performance quantitatively and qualitatively, as
indicated by our analysis.
Related papers
- Detecting Multimodal Situations with Insufficient Context and Abstaining from Baseless Predictions [75.45274978665684]
Vision-Language Understanding (VLU) benchmarks contain samples where answers rely on assumptions unsupported by the provided context.
We collect contextual data for each sample whenever available and train a context selection module to facilitate evidence-based model predictions.
We develop a general-purpose Context-AwaRe Abstention detector to identify samples lacking sufficient context and enhance model accuracy.
arXiv Detail & Related papers (2024-05-18T02:21:32Z) - Dynamic Clue Bottlenecks: Towards Interpretable-by-Design Visual Question Answering [58.64831511644917]
We introduce an interpretable by design model that factors model decisions into intermediate human-legible explanations.
We show that our inherently interpretable system can improve 4.64% over a comparable black-box system in reasoning-focused questions.
arXiv Detail & Related papers (2023-05-24T08:33:15Z) - Improving Visual Question Answering Models through Robustness Analysis
and In-Context Learning with a Chain of Basic Questions [70.70725223310401]
This work proposes a new method that utilizes semantically related questions, referred to as basic questions, acting as noise to evaluate the robustness of VQA models.
The experimental results demonstrate that the proposed evaluation method effectively analyzes the robustness of VQA models.
arXiv Detail & Related papers (2023-04-06T15:32:35Z) - Logical Implications for Visual Question Answering Consistency [2.005299372367689]
We introduce a new consistency loss term that can be used by a wide range of the VQA models.
We propose to infer these logical relations using a dedicated language model and use these in our proposed consistency loss function.
We conduct extensive experiments on the VQA Introspect and DME datasets and show that our method brings improvements to state-of-the-art VQA models.
arXiv Detail & Related papers (2023-03-16T16:00:18Z) - COIN: Counterfactual Image Generation for VQA Interpretation [5.994412766684842]
We introduce an interpretability approach for VQA models by generating counterfactual images.
In addition to interpreting the result of VQA models on single images, the obtained results and the discussion provides an extensive explanation of VQA models' behaviour.
arXiv Detail & Related papers (2022-01-10T13:51:35Z) - Latent Variable Models for Visual Question Answering [34.9601948665926]
We propose latent variable models for Visual Question Answering.
Extra information (e.g. captions and answer categories) are incorporated as latent variables to improve inference.
Experiments on the VQA v2.0 benchmarking dataset demonstrate the effectiveness of our proposed models.
arXiv Detail & Related papers (2021-01-16T08:21:43Z) - Learning from Lexical Perturbations for Consistent Visual Question
Answering [78.21912474223926]
Existing Visual Question Answering (VQA) models are often fragile and sensitive to input variations.
We propose a novel approach to address this issue based on modular networks, which creates two questions related by linguistic perturbations.
We also present VQA Perturbed Pairings (VQA P2), a new, low-cost benchmark and augmentation pipeline to create controllable linguistic variations.
arXiv Detail & Related papers (2020-11-26T17:38:03Z) - Loss re-scaling VQA: Revisiting the LanguagePrior Problem from a
Class-imbalance View [129.392671317356]
We propose to interpret the language prior problem in VQA from a class-imbalance view.
It explicitly reveals why the VQA model tends to produce a frequent yet obviously wrong answer.
We also justify the validity of the class imbalance interpretation scheme on other computer vision tasks, such as face recognition and image classification.
arXiv Detail & Related papers (2020-10-30T00:57:17Z) - MUTANT: A Training Paradigm for Out-of-Distribution Generalization in
Visual Question Answering [58.30291671877342]
We present MUTANT, a training paradigm that exposes the model to perceptually similar, yet semantically distinct mutations of the input.
MUTANT establishes a new state-of-the-art accuracy on VQA-CP with a $10.57%$ improvement.
arXiv Detail & Related papers (2020-09-18T00:22:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.