Related papers: VisQA: X-raying Vision and Language Reasoning in Transformers

VisQA: X-raying Vision and Language Reasoning in Transformers

URL: http://arxiv.org/abs/2104.00926v1
Date: Fri, 2 Apr 2021 08:08:25 GMT
Title: VisQA: X-raying Vision and Language Reasoning in Transformers
Authors: Theo Jaunet, Corentin Kervadec, Romain Vuillemot, Grigory Antipov, Moez Baccouche and Christian Wolf
Abstract summary: Recent research has shown that state-of-the-art models tend to produce answers exploiting biases and shortcuts in the training data. We present VisQA, a visual analytics tool that explores this question of reasoning vs. bias exploitation.
Score: 10.439369423744708
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Visual Question Answering systems target answering open-ended textual questions given input images. They are a testbed for learning high-level reasoning with a primary use in HCI, for instance assistance for the visually impaired. Recent research has shown that state-of-the-art models tend to produce answers exploiting biases and shortcuts in the training data, and sometimes do not even look at the input image, instead of performing the required reasoning steps. We present VisQA, a visual analytics tool that explores this question of reasoning vs. bias exploitation. It exposes the key element of state-of-the-art neural models -- attention maps in transformers. Our working hypothesis is that reasoning steps leading to model predictions are observable from attention distributions, which are particularly useful for visualization. The design process of VisQA was motivated by well-known bias examples from the fields of deep learning and vision-language reasoning and evaluated in two ways. First, as a result of a collaboration of three fields, machine learning, vision and language reasoning, and data analytics, the work lead to a direct impact on the design and training of a neural model for VQA, improving model performance as a consequence. Second, we also report on the design of VisQA, and a goal-oriented evaluation of VisQA targeting the analysis of a model decision process from multiple experts, providing evidence that it makes the inner workings of models accessible to users.

Related papers

Enhancing Scientific Visual Question Answering through Multimodal Reasoning and Ensemble Modeling [0.0]
Current approaches to visual question answering often struggle with the precision required for scientific data interpretation.<n>We present our approach to the SciVQA 2025 shared task, focusing on answering visual and non-visual questions grounded in scientific figures from scholarly articles.<n>Our findings underscore the effectiveness of prompt optimization, chain-of-thought reasoning and ensemble modeling in improving the model's ability in visual question answering.
arXiv Detail & Related papers (2025-07-08T17:05:42Z)
VisionReasoner: Unified Visual Perception and Reasoning via Reinforcement Learning [55.34552054232695]
We introduce VisionReasoner, a unified framework capable of reasoning and solving multiple visual perception tasks.<n>We evaluate VisionReasoner on ten diverse tasks spanning three critical domains: detection, segmentation, and counting.
arXiv Detail & Related papers (2025-05-17T16:51:47Z)
QIRL: Boosting Visual Question Answering via Optimized Question-Image Relation Learning [38.24779287568123]
Current debiasing techniques fail to capture the superior relation between images and texts. No prior work has examined the degree of input relevance in debiasing studies. We propose a novel framework, which employs a generation-based self-supervised learning strategy.
arXiv Detail & Related papers (2025-04-04T10:38:28Z)
Envisioning Beyond the Pixels: Benchmarking Reasoning-Informed Visual Editing [84.16442052968615]
We introduce RISEBench, the first benchmark for evaluating Reasoning-Informed viSual Editing (RISE)<n>RISEBench focuses on four key reasoning categories: Temporal, Causal, Spatial, and Logical Reasoning.<n>We conduct experiments evaluating nine prominent visual editing models, comprising both open-source and proprietary models.
arXiv Detail & Related papers (2025-04-03T17:59:56Z)
Multi-Modal Prompt Learning on Blind Image Quality Assessment [65.0676908930946]
Image Quality Assessment (IQA) models benefit significantly from semantic information, which allows them to treat different types of objects distinctly. Traditional methods, hindered by a lack of sufficiently annotated data, have employed the CLIP image-text pretraining model as their backbone to gain semantic awareness. Recent approaches have attempted to address this mismatch using prompt technology, but these solutions have shortcomings. This paper introduces an innovative multi-modal prompt-based methodology for IQA.
arXiv Detail & Related papers (2024-04-23T11:45:32Z)
CogCoM: A Visual Language Model with Chain-of-Manipulations Reasoning [61.21923643289266]
Chain of Manipulations is a mechanism that enables Vision-Language Models to solve problems step-by-step with evidence. After training, models can solve various visual problems by eliciting intrinsic manipulations (e.g., grounding, zoom in) actively without involving external tools. Our trained model, textbfCogCoM, achieves state-of-the-art performance across 9 benchmarks from 4 categories.
arXiv Detail & Related papers (2024-02-06T18:43:48Z)
Knowledge-Based Counterfactual Queries for Visual Question Answering [0.0]
We propose a systematic method for explaining the behavior and investigating the robustness of VQA models through counterfactual perturbations. For this reason, we exploit structured knowledge bases to perform deterministic, optimal and controllable word-level replacements targeting the linguistic modality. We then evaluate the model's response against such counterfactual inputs.
arXiv Detail & Related papers (2023-03-05T08:00:30Z)
Task Formulation Matters When Learning Continually: A Case Study in Visual Question Answering [58.82325933356066]
Continual learning aims to train a model incrementally on a sequence of tasks without forgetting previous knowledge. We present a detailed study of how different settings affect performance for Visual Question Answering.
arXiv Detail & Related papers (2022-09-30T19:12:58Z)
Visual Perturbation-aware Collaborative Learning for Overcoming the Language Prior Problem [60.0878532426877]
We propose a novel collaborative learning scheme from the viewpoint of visual perturbation calibration. Specifically, we devise a visual controller to construct two sorts of curated images with different perturbation extents. The experimental results on two diagnostic VQA-CP benchmark datasets evidently demonstrate its effectiveness.
arXiv Detail & Related papers (2022-07-24T23:50:52Z)
COIN: Counterfactual Image Generation for VQA Interpretation [5.994412766684842]
We introduce an interpretability approach for VQA models by generating counterfactual images. In addition to interpreting the result of VQA models on single images, the obtained results and the discussion provides an extensive explanation of VQA models' behaviour.
arXiv Detail & Related papers (2022-01-10T13:51:35Z)
AdViCE: Aggregated Visual Counterfactual Explanations for Machine Learning Model Validation [9.996986104171754]
We introduce AdViCE, a visual analytics tool that aims to guide users in black-box model debug and validation. The solution rests on two main visual user interface innovations: (1) an interactive visualization that enables the comparison of decisions on user-defined data subsets; (2) an algorithm and visual design to compute and visualize counterfactual explanations.
arXiv Detail & Related papers (2021-09-12T22:52:12Z)
How Transferable are Reasoning Patterns in VQA? [10.439369423744708]
We argue that uncertainty in vision is a dominating factor preventing the successful learning of reasoning in vision and language problems. We train a visual oracle and in a large scale study provide experimental evidence that it is much less prone to exploiting spurious dataset biases. We exploit these insights by transferring reasoning patterns from the oracle to a SOTA Transformer-based VQA model taking standard noisy visual inputs via fine-tuning.
arXiv Detail & Related papers (2021-04-08T10:18:45Z)
Loss re-scaling VQA: Revisiting the LanguagePrior Problem from a Class-imbalance View [129.392671317356]
We propose to interpret the language prior problem in VQA from a class-imbalance view. It explicitly reveals why the VQA model tends to produce a frequent yet obviously wrong answer. We also justify the validity of the class imbalance interpretation scheme on other computer vision tasks, such as face recognition and image classification.
arXiv Detail & Related papers (2020-10-30T00:57:17Z)
Neuro-Symbolic Visual Reasoning: Disentangling "Visual" from "Reasoning" [49.76230210108583]
We propose a framework to isolate and evaluate the reasoning aspect of visual question answering (VQA) separately from its perception. We also propose a novel top-down calibration technique that allows the model to answer reasoning questions even with imperfect perception. On the challenging GQA dataset, this framework is used to perform in-depth, disentangled comparisons between well-known VQA models.
arXiv Detail & Related papers (2020-06-20T08:48:29Z)
Behind the Scene: Revealing the Secrets of Pre-trained Vision-and-Language Models [65.19308052012858]
Recent Transformer-based large-scale pre-trained models have revolutionized vision-and-language (V+L) research. We present VALUE, a set of meticulously designed probing tasks to decipher the inner workings of multimodal pre-training. Key observations: Pre-trained models exhibit a propensity for attending over text rather than images during inference.
arXiv Detail & Related papers (2020-05-15T01:06:54Z)
Component Analysis for Visual Question Answering Architectures [10.56011196733086]
The main goal of this paper is to provide a comprehensive analysis regarding the impact of each component in Visual Question Answering models. Our major contribution is to identify core components for training VQA models so as to maximize their predictive performance.
arXiv Detail & Related papers (2020-02-12T17:25:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.