ProtoVQA: An Adaptable Prototypical Framework for Explainable Fine-Grained Visual Question Answering
- URL: http://arxiv.org/abs/2509.16680v1
- Date: Sat, 20 Sep 2025 13:12:08 GMT
- Title: ProtoVQA: An Adaptable Prototypical Framework for Explainable Fine-Grained Visual Question Answering
- Authors: Xingjian Diao, Weiyi Wu, Keyi Kong, Peijun Qing, Xinwen Xu, Ming Cheng, Soroush Vosoughi, Jiang Gui,
- Abstract summary: Prototype-based modeling has shown promise for interpretability by grounding predictions in semantically meaningful regions.<n>We present ProtoVQA, a unified framework that learns question-aware prototypes that serve as reasoning anchors.<n>We propose the Visual-Linguistic Alignment Score (VLAS), which measures how well the model's attended regions align with ground-truth evidence.
- Score: 30.65960340061612
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Visual Question Answering (VQA) is increasingly used in diverse applications ranging from general visual reasoning to safety-critical domains such as medical imaging and autonomous systems, where models must provide not only accurate answers but also explanations that humans can easily understand and verify. Prototype-based modeling has shown promise for interpretability by grounding predictions in semantically meaningful regions for purely visual reasoning tasks, yet remains underexplored in the context of VQA. We present ProtoVQA, a unified prototypical framework that (i) learns question-aware prototypes that serve as reasoning anchors, connecting answers to discriminative image regions, (ii) applies spatially constrained matching to ensure that the selected evidence is coherent and semantically relevant, and (iii) supports both answering and grounding tasks through a shared prototype backbone. To assess explanation quality, we propose the Visual-Linguistic Alignment Score (VLAS), which measures how well the model's attended regions align with ground-truth evidence. Experiments on Visual7W show that ProtoVQA yields faithful, fine-grained explanations while maintaining competitive accuracy, advancing the development of transparent and trustworthy VQA systems.
Related papers
- Can Unified Generation and Understanding Models Maintain Semantic Equivalence Across Different Output Modalities? [61.533560295383786]
Unified Multimodal Large Language Models (U-MLLMs) integrate understanding and generation within a single architecture.<n>We observe that U-MLLMs fail to maintain semantic equivalence when required to render the same results in the image modality.<n>We introduce VGUBench, a framework to decouple reasoning logic from generation fidelity.
arXiv Detail & Related papers (2026-02-27T06:23:56Z) - Look As You Think: Unifying Reasoning and Visual Evidence Attribution for Verifiable Document RAG via Reinforcement Learning [55.232400251303794]
Look As You Think (LAT) is a reinforcement learning framework that trains models to produce verifiable reasoning paths with consistent attribution.<n>LAT consistently improves the vanilla model in both single- and multi-image settings, yielding average gains of 8.23% in soft exact match (EM) and 47.0% in IoU@0.5.
arXiv Detail & Related papers (2025-11-15T02:50:23Z) - Exploring Interpretability for Visual Prompt Tuning with Hierarchical Concepts [39.92376420375139]
We propose the first framework, named Interpretable Visual Prompt Tuning, to explore interpretability for visual prompts.<n>Visual prompts are linked to human-understandable semantic concepts, represented as a set of category-agnostic prototypes.<n>IVPT aggregates features from these regions to generate interpretable prompts, which are structured hierarchically to explain visual prompts at different granularities.
arXiv Detail & Related papers (2025-03-08T06:12:50Z) - Detecting Multimodal Situations with Insufficient Context and Abstaining from Baseless Predictions [75.45274978665684]
Vision-Language Understanding (VLU) benchmarks contain samples where answers rely on assumptions unsupported by the provided context.<n>We collect contextual data for each sample whenever available and train a context selection module to facilitate evidence-based model predictions.<n>We develop a general-purpose Context-AwaRe Abstention detector to identify samples lacking sufficient context and enhance model accuracy.
arXiv Detail & Related papers (2024-05-18T02:21:32Z) - Multimodal Rationales for Explainable Visual Question Answering [12.893224628061516]
Visual Question Answering (VQA) is a challenging task of predicting the answer to a question about the content of an image.<n>We propose a novel model termed MRVQA, which provides visual and textual rationales to support its predicted answers.<n>MRVQA achieves new state-of-the-art results through additional rationale generation, enhancing the trustworthiness of the model.
arXiv Detail & Related papers (2024-02-06T11:07:05Z) - Dynamic Clue Bottlenecks: Towards Interpretable-by-Design Visual Question Answering [58.64831511644917]
We introduce an interpretable by design model that factors model decisions into intermediate human-legible explanations.
We show that our inherently interpretable system can improve 4.64% over a comparable black-box system in reasoning-focused questions.
arXiv Detail & Related papers (2023-05-24T08:33:15Z) - Visually Grounded VQA by Lattice-based Retrieval [24.298908211088072]
Visual Grounding (VG) in Visual Question Answering (VQA) systems describes how well a system manages to tie a question and its answer to relevant image regions.
In this work, we break with the dominant VQA modeling paradigm of classification and investigate VQA from the standpoint of an information retrieval task.
Our system operates over a weighted, directed, acyclic graph, a.k.a. "lattice", which is derived from the scene graph of a given image in conjunction with region-referring expressions extracted from the question.
arXiv Detail & Related papers (2022-11-15T12:12:08Z) - ADVISE: ADaptive Feature Relevance and VISual Explanations for
Convolutional Neural Networks [0.745554610293091]
We introduce ADVISE, a new explainability method that quantifies and leverages the relevance of each unit of the feature map to provide better visual explanations.
We extensively evaluate our idea in the image classification task using AlexNet, VGG16, ResNet50, and Xception pretrained on ImageNet.
Our experiments further show that ADVISE fulfils the sensitivity and implementation independence axioms while passing the sanity checks.
arXiv Detail & Related papers (2022-03-02T18:16:57Z) - Coarse-to-Fine Reasoning for Visual Question Answering [18.535633096397397]
We present a new reasoning framework to fill the gap between visual features and semantic clues in the Visual Question Answering (VQA) task.
Our method first extracts the features and predicates from the image and question.
We then propose a new reasoning framework to effectively jointly learn these features and predicates in a coarse-to-fine manner.
arXiv Detail & Related papers (2021-10-06T06:29:52Z) - MUTANT: A Training Paradigm for Out-of-Distribution Generalization in
Visual Question Answering [58.30291671877342]
We present MUTANT, a training paradigm that exposes the model to perceptually similar, yet semantically distinct mutations of the input.
MUTANT establishes a new state-of-the-art accuracy on VQA-CP with a $10.57%$ improvement.
arXiv Detail & Related papers (2020-09-18T00:22:54Z) - Neuro-Symbolic Visual Reasoning: Disentangling "Visual" from "Reasoning" [49.76230210108583]
We propose a framework to isolate and evaluate the reasoning aspect of visual question answering (VQA) separately from its perception.
We also propose a novel top-down calibration technique that allows the model to answer reasoning questions even with imperfect perception.
On the challenging GQA dataset, this framework is used to perform in-depth, disentangled comparisons between well-known VQA models.
arXiv Detail & Related papers (2020-06-20T08:48:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.