Checkmate: interpretable and explainable RSVQA is the endgame
- URL: http://arxiv.org/abs/2508.13086v1
- Date: Mon, 18 Aug 2025 16:59:43 GMT
- Title: Checkmate: interpretable and explainable RSVQA is the endgame
- Authors: Lucrezia Tosato, Christel Tartini Chappuis, Syrielle Montariol, Flora Weissgerber, Sylvain Lobry, Devis Tuia,
- Abstract summary: We introduce a novel RSVQA dataset, Chessboard, designed to minimize biases through 3'123'253 questions.<n>Each answer is linked to one or more cells within the image, enabling fine-grained visual reasoning.<n>We develop an explainable and interpretable model called Checkmate that identifies the image cells most relevant to its decisions.
- Score: 5.445304535169411
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Remote Sensing Visual Question Answering (RSVQA) presents unique challenges in ensuring that model decisions are both understandable and grounded in visual content. Current models often suffer from a lack of interpretability and explainability, as well as from biases in dataset distributions that lead to shortcut learning. In this work, we tackle these issues by introducing a novel RSVQA dataset, Chessboard, designed to minimize biases through 3'123'253 questions and a balanced answer distribution. Each answer is linked to one or more cells within the image, enabling fine-grained visual reasoning. Building on this dataset, we develop an explainable and interpretable model called Checkmate that identifies the image cells most relevant to its decisions. Through extensive experiments across multiple model architectures, we show that our approach improves transparency and supports more trustworthy decision-making in RSVQA systems.
Related papers
- Multimodal Rationales for Explainable Visual Question Answering [12.893224628061516]
Visual Question Answering (VQA) is a challenging task of predicting the answer to a question about the content of an image.<n>We propose a novel model termed MRVQA, which provides visual and textual rationales to support its predicted answers.<n>MRVQA achieves new state-of-the-art results through additional rationale generation, enhancing the trustworthiness of the model.
arXiv Detail & Related papers (2024-02-06T11:07:05Z) - UNK-VQA: A Dataset and a Probe into the Abstention Ability of Multi-modal Large Models [55.22048505787125]
This paper contributes a comprehensive dataset, called UNK-VQA.
We first augment the existing data via deliberate perturbations on either the image or question.
We then extensively evaluate the zero- and few-shot performance of several emerging multi-modal large models.
arXiv Detail & Related papers (2023-10-17T02:38:09Z) - Dynamic Clue Bottlenecks: Towards Interpretable-by-Design Visual Question Answering [58.64831511644917]
We introduce an interpretable by design model that factors model decisions into intermediate human-legible explanations.
We show that our inherently interpretable system can improve 4.64% over a comparable black-box system in reasoning-focused questions.
arXiv Detail & Related papers (2023-05-24T08:33:15Z) - Multilingual Augmentation for Robust Visual Question Answering in Remote
Sensing Images [19.99615698375829]
We propose a contrastive learning strategy for training robust RSVQA models against diverse question templates and words.
Experimental results demonstrate that the proposed augmented dataset is effective in improving the robustness of the RSVQA model.
arXiv Detail & Related papers (2023-04-07T21:06:58Z) - Barlow constrained optimization for Visual Question Answering [105.3372546782068]
We propose a novel regularization for VQA models, Constrained Optimization using Barlow's theory (COB)
Our model also aligns the joint space with the answer embedding space, where we consider the answer and image+question as two different views' of what in essence is the same semantic information.
When built on the state-of-the-art GGE model, the resulting model improves VQA accuracy by 1.4% and 4% on the VQA-CP v2 and VQA v2 datasets respectively.
arXiv Detail & Related papers (2022-03-07T21:27:40Z) - Select, Substitute, Search: A New Benchmark for Knowledge-Augmented
Visual Question Answering [35.855792706139525]
Multimodal IR, spanning text corpus, knowledge graph and images, is of much recent interest.
A surprisingly large fraction of queries do not assess the ability to integrate cross-modal information.
We build a new data set and challenge around a key structural idiom in OKVQA,viz., S3.
arXiv Detail & Related papers (2021-03-09T17:19:50Z) - Knowledge-Routed Visual Question Reasoning: Challenges for Deep
Representation Embedding [140.5911760063681]
We propose a novel dataset named Knowledge-Routed Visual Question Reasoning for VQA model evaluation.
We generate the question-answer pair based on both the Visual Genome scene graph and an external knowledge base with controlled programs.
arXiv Detail & Related papers (2020-12-14T00:33:44Z) - Loss re-scaling VQA: Revisiting the LanguagePrior Problem from a
Class-imbalance View [129.392671317356]
We propose to interpret the language prior problem in VQA from a class-imbalance view.
It explicitly reveals why the VQA model tends to produce a frequent yet obviously wrong answer.
We also justify the validity of the class imbalance interpretation scheme on other computer vision tasks, such as face recognition and image classification.
arXiv Detail & Related papers (2020-10-30T00:57:17Z) - Robust Question Answering Through Sub-part Alignment [53.94003466761305]
We model question answering as an alignment problem.
We train our model on SQuAD v1.1 and test it on several adversarial and out-of-domain datasets.
arXiv Detail & Related papers (2020-04-30T09:10:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.