Related papers: VisionTrap: Unanswerable Questions On Visual Data

VisionTrap: Unanswerable Questions On Visual Data

URL: http://arxiv.org/abs/2507.17262v1
Date: Wed, 23 Jul 2025 07:00:19 GMT
Title: VisionTrap: Unanswerable Questions On Visual Data
Authors: Asir Saadat, Syem Aziz, Shahriar Mahmud, Abdullah Ibne Masud Mahi, Sabbir Ahmed,
Abstract summary: This research investigates VQA performance on unrealistically generated images or asking unanswerable questions.<n>We introduce a dataset, VisionTrap, comprising three categories of unanswerable questions across diverse image types.<n>Our findings highlight the importance of incorporating such questions into VQA benchmarks to evaluate whether models tend to answer, even when they should abstain.
Score: 1.0485739694839669
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Visual Question Answering (VQA) has been a widely studied topic, with extensive research focusing on how VLMs respond to answerable questions based on real-world images. However, there has been limited exploration of how these models handle unanswerable questions, particularly in cases where they should abstain from providing a response. This research investigates VQA performance on unrealistically generated images or asking unanswerable questions, assessing whether models recognize the limitations of their knowledge or attempt to generate incorrect answers. We introduced a dataset, VisionTrap, comprising three categories of unanswerable questions across diverse image types: (1) hybrid entities that fuse objects and animals, (2) objects depicted in unconventional or impossible scenarios, and (3) fictional or non-existent figures. The questions posed are logically structured yet inherently unanswerable, testing whether models can correctly recognize their limitations. Our findings highlight the importance of incorporating such questions into VQA benchmarks to evaluate whether models tend to answer, even when they should abstain.

Related papers

Ask Questions with Double Hints: Visual Question Generation with Answer-awareness and Region-reference [107.53380946417003]
We propose a novel learning paradigm to generate visual questions with answer-awareness and region-reference. We develop a simple methodology to self-learn the visual hints without introducing any additional human annotations.
arXiv Detail & Related papers (2024-07-06T15:07:32Z)
Disentangling Knowledge-based and Visual Reasoning by Question Decomposition in KB-VQA [19.6585442152102]
We study the Knowledge-Based visual question-answering problem, for which given a question, the models need to ground it into the visual modality to find the answer. Our study shows that replacing a complex question with several simpler questions helps to extract more relevant information from the image.
arXiv Detail & Related papers (2024-06-27T02:19:38Z)
Don't Just Say "I don't know"! Self-aligning Large Language Models for Responding to Unknown Questions with Explanations [70.6395572287422]
Self-alignment method is capable of not only refusing to answer but also providing explanation to the unanswerability of unknown questions. We conduct disparity-driven self-curation to select qualified data for fine-tuning the LLM itself for aligning the responses to unknown questions as desired.
arXiv Detail & Related papers (2024-02-23T02:24:36Z)
CommVQA: Situating Visual Question Answering in Communicative Contexts [16.180130883242672]
We introduce CommVQA, a dataset consisting of images, image descriptions, real-world communicative scenarios where the image might appear. We show that access to contextual information is essential for solving CommVQA, leading to the highest performing VQA model.
arXiv Detail & Related papers (2024-02-22T22:31:39Z)
UNK-VQA: A Dataset and a Probe into the Abstention Ability of Multi-modal Large Models [55.22048505787125]
This paper contributes a comprehensive dataset, called UNK-VQA. We first augment the existing data via deliberate perturbations on either the image or question. We then extensively evaluate the zero- and few-shot performance of several emerging multi-modal large models.
arXiv Detail & Related papers (2023-10-17T02:38:09Z)
Open-Set Knowledge-Based Visual Question Answering with Inference Paths [79.55742631375063]
The purpose of Knowledge-Based Visual Question Answering (KB-VQA) is to provide a correct answer to the question with the aid of external knowledge bases. We propose a new retriever-ranker paradigm of KB-VQA, Graph pATH rankER (GATHER for brevity) Specifically, it contains graph constructing, pruning, and path-level ranking, which not only retrieves accurate answers but also provides inference paths that explain the reasoning process.
arXiv Detail & Related papers (2023-10-12T09:12:50Z)
Localized Questions in Medical Visual Question Answering [2.005299372367689]
Visual Question Answering (VQA) models aim to answer natural language questions about given images. Existing medical VQA models typically focus on answering questions that refer to an entire image. This paper proposes a novel approach for medical VQA that addresses this limitation by developing a model that can answer questions about image regions.
arXiv Detail & Related papers (2023-07-03T14:47:18Z)
Knowledge-Routed Visual Question Reasoning: Challenges for Deep Representation Embedding [140.5911760063681]
We propose a novel dataset named Knowledge-Routed Visual Question Reasoning for VQA model evaluation. We generate the question-answer pair based on both the Visual Genome scene graph and an external knowledge base with controlled programs.
arXiv Detail & Related papers (2020-12-14T00:33:44Z)
SQuINTing at VQA Models: Introspecting VQA Models with Sub-Questions [66.86887670416193]
We show that state-of-the-art VQA models have comparable performance in answering perception and reasoning questions, but suffer from consistency problems. To address this shortcoming, we propose an approach called Sub-Question-aware Network Tuning (SQuINT) We show that SQuINT improves model consistency by 5%, also marginally improving performance on the Reasoning questions in VQA, while also displaying better attention maps.
arXiv Detail & Related papers (2020-01-20T01:02:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.