VisQA: X-raying Vision and Language Reasoning in Transformers
- URL: http://arxiv.org/abs/2104.00926v1
- Date: Fri, 2 Apr 2021 08:08:25 GMT
- Title: VisQA: X-raying Vision and Language Reasoning in Transformers
- Authors: Theo Jaunet, Corentin Kervadec, Romain Vuillemot, Grigory Antipov,
Moez Baccouche and Christian Wolf
- Abstract summary: Recent research has shown that state-of-the-art models tend to produce answers exploiting biases and shortcuts in the training data.
We present VisQA, a visual analytics tool that explores this question of reasoning vs. bias exploitation.
- Score: 10.439369423744708
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Visual Question Answering systems target answering open-ended textual
questions given input images. They are a testbed for learning high-level
reasoning with a primary use in HCI, for instance assistance for the visually
impaired. Recent research has shown that state-of-the-art models tend to
produce answers exploiting biases and shortcuts in the training data, and
sometimes do not even look at the input image, instead of performing the
required reasoning steps. We present VisQA, a visual analytics tool that
explores this question of reasoning vs. bias exploitation. It exposes the key
element of state-of-the-art neural models -- attention maps in transformers.
Our working hypothesis is that reasoning steps leading to model predictions are
observable from attention distributions, which are particularly useful for
visualization. The design process of VisQA was motivated by well-known bias
examples from the fields of deep learning and vision-language reasoning and
evaluated in two ways. First, as a result of a collaboration of three fields,
machine learning, vision and language reasoning, and data analytics, the work
lead to a direct impact on the design and training of a neural model for VQA,
improving model performance as a consequence. Second, we also report on the
design of VisQA, and a goal-oriented evaluation of VisQA targeting the analysis
of a model decision process from multiple experts, providing evidence that it
makes the inner workings of models accessible to users.
Related papers
- Multi-Modal Prompt Learning on Blind Image Quality Assessment [65.0676908930946]
Image Quality Assessment (IQA) models benefit significantly from semantic information, which allows them to treat different types of objects distinctly.
Traditional methods, hindered by a lack of sufficiently annotated data, have employed the CLIP image-text pretraining model as their backbone to gain semantic awareness.
Recent approaches have attempted to address this mismatch using prompt technology, but these solutions have shortcomings.
This paper introduces an innovative multi-modal prompt-based methodology for IQA.
arXiv Detail & Related papers (2024-04-23T11:45:32Z) - Knowledge-Based Counterfactual Queries for Visual Question Answering [0.0]
We propose a systematic method for explaining the behavior and investigating the robustness of VQA models through counterfactual perturbations.
For this reason, we exploit structured knowledge bases to perform deterministic, optimal and controllable word-level replacements targeting the linguistic modality.
We then evaluate the model's response against such counterfactual inputs.
arXiv Detail & Related papers (2023-03-05T08:00:30Z) - Task Formulation Matters When Learning Continually: A Case Study in
Visual Question Answering [58.82325933356066]
Continual learning aims to train a model incrementally on a sequence of tasks without forgetting previous knowledge.
We present a detailed study of how different settings affect performance for Visual Question Answering.
arXiv Detail & Related papers (2022-09-30T19:12:58Z) - Visual Perturbation-aware Collaborative Learning for Overcoming the
Language Prior Problem [60.0878532426877]
We propose a novel collaborative learning scheme from the viewpoint of visual perturbation calibration.
Specifically, we devise a visual controller to construct two sorts of curated images with different perturbation extents.
The experimental results on two diagnostic VQA-CP benchmark datasets evidently demonstrate its effectiveness.
arXiv Detail & Related papers (2022-07-24T23:50:52Z) - COIN: Counterfactual Image Generation for VQA Interpretation [5.994412766684842]
We introduce an interpretability approach for VQA models by generating counterfactual images.
In addition to interpreting the result of VQA models on single images, the obtained results and the discussion provides an extensive explanation of VQA models' behaviour.
arXiv Detail & Related papers (2022-01-10T13:51:35Z) - AdViCE: Aggregated Visual Counterfactual Explanations for Machine
Learning Model Validation [9.996986104171754]
We introduce AdViCE, a visual analytics tool that aims to guide users in black-box model debug and validation.
The solution rests on two main visual user interface innovations: (1) an interactive visualization that enables the comparison of decisions on user-defined data subsets; (2) an algorithm and visual design to compute and visualize counterfactual explanations.
arXiv Detail & Related papers (2021-09-12T22:52:12Z) - How Transferable are Reasoning Patterns in VQA? [10.439369423744708]
We argue that uncertainty in vision is a dominating factor preventing the successful learning of reasoning in vision and language problems.
We train a visual oracle and in a large scale study provide experimental evidence that it is much less prone to exploiting spurious dataset biases.
We exploit these insights by transferring reasoning patterns from the oracle to a SOTA Transformer-based VQA model taking standard noisy visual inputs via fine-tuning.
arXiv Detail & Related papers (2021-04-08T10:18:45Z) - Loss re-scaling VQA: Revisiting the LanguagePrior Problem from a
Class-imbalance View [129.392671317356]
We propose to interpret the language prior problem in VQA from a class-imbalance view.
It explicitly reveals why the VQA model tends to produce a frequent yet obviously wrong answer.
We also justify the validity of the class imbalance interpretation scheme on other computer vision tasks, such as face recognition and image classification.
arXiv Detail & Related papers (2020-10-30T00:57:17Z) - Neuro-Symbolic Visual Reasoning: Disentangling "Visual" from "Reasoning" [49.76230210108583]
We propose a framework to isolate and evaluate the reasoning aspect of visual question answering (VQA) separately from its perception.
We also propose a novel top-down calibration technique that allows the model to answer reasoning questions even with imperfect perception.
On the challenging GQA dataset, this framework is used to perform in-depth, disentangled comparisons between well-known VQA models.
arXiv Detail & Related papers (2020-06-20T08:48:29Z) - Behind the Scene: Revealing the Secrets of Pre-trained
Vision-and-Language Models [65.19308052012858]
Recent Transformer-based large-scale pre-trained models have revolutionized vision-and-language (V+L) research.
We present VALUE, a set of meticulously designed probing tasks to decipher the inner workings of multimodal pre-training.
Key observations: Pre-trained models exhibit a propensity for attending over text rather than images during inference.
arXiv Detail & Related papers (2020-05-15T01:06:54Z) - Component Analysis for Visual Question Answering Architectures [10.56011196733086]
The main goal of this paper is to provide a comprehensive analysis regarding the impact of each component in Visual Question Answering models.
Our major contribution is to identify core components for training VQA models so as to maximize their predictive performance.
arXiv Detail & Related papers (2020-02-12T17:25:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.