BLaVe-CoT: Consistency-Aware Visual Question Answering for Blind and Low Vision Users
- URL: http://arxiv.org/abs/2509.06010v1
- Date: Sun, 07 Sep 2025 10:58:17 GMT
- Title: BLaVe-CoT: Consistency-Aware Visual Question Answering for Blind and Low Vision Users
- Authors: Wanyin Cheng, Zanxi Ruan,
- Abstract summary: Visual Question Answering (VQA) holds great potential for assisting Blind and Low Vision (BLV) users.<n>We present BLaVe-CoT, a VQA framework designed to reason about answer consistency in the face of ambiguity.
- Score: 0.42970700836450487
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Visual Question Answering (VQA) holds great potential for assisting Blind and Low Vision (BLV) users, yet real-world usage remains challenging. Due to visual impairments, BLV users often take blurry or poorly framed photos and face difficulty in articulating specific questions about what they cannot fully see. As a result, their visual questions are frequently ambiguous, and different users may interpret them in diverse ways. This leads to multiple valid answers, each grounded in different image regions-posing a mismatch with conventional VQA systems that assume a single answer and region. To bridge this gap, we present BLaVe-CoT, a VQA framework designed to reason about answer consistency in the face of ambiguity. Our method proposes diverse candidate answers using a LoRA-tuned BLIP-2 model, then grounds each answer spatially using PolyFormer, and finally applies a chain-of-thought reasoning module to assess whether the answers refer to the same or different regions. Evaluated on the VQA-AnswerTherapy benchmark, BLaVe-CoT outperforms previous methods and proves more robust to the ambiguity and visual noise common in assistive settings. This work highlights the need for VQA systems that can adapt to real human uncertainty and provide inclusive support for BLV users. To foster further research and accessibility applications, we have made the code publicly available at https://github.com/Accecwan/BLaVe-CoT.
Related papers
- UQ: Assessing Language Models on Unsolved Questions [149.46593270027697]
We introduce UQ, a testbed of 500 challenging, diverse questions sourced from Stack Exchange.<n>UQ is difficult and realistic by construction: unsolved questions are often hard and naturally arise when humans seek answers.<n>The top model passes UQ-validation on only 15% of questions, and preliminary human verification has already identified correct answers.
arXiv Detail & Related papers (2025-08-25T01:07:59Z) - Teaching Vision-Language Models to Ask: Resolving Ambiguity in Visual Questions [17.905632446959007]
In visual question answering (VQA) context, users often pose ambiguous questions to visual language models (VLMs) due to varying expression habits.<n>We introduce bftextClearVQA benchmark, which targets three common categories of ambiguity in VQA context.
arXiv Detail & Related papers (2025-07-18T09:31:43Z) - COLUMBUS: Evaluating COgnitive Lateral Understanding through Multiple-choice reBUSes [14.603382370403]
We formulate visual lateral thinking as a multiple-choice question-answering task.<n>We describe a three-step taxonomy-driven methodology for instantiating task examples.<n>We develop COLUMBUS, a synthetic benchmark that applies the task pipeline to create QA sets with text and icon rebus puzzles.
arXiv Detail & Related papers (2024-09-06T06:49:55Z) - Long-Form Answers to Visual Questions from Blind and Low Vision People [54.00665222249701]
VizWiz-LF is a dataset of long-form answers to visual questions posed by blind and low vision (BLV) users.<n>We develop and annotate functional roles of sentences of LFVQA and demonstrate that long-form answers contain information beyond the question answer.
arXiv Detail & Related papers (2024-08-12T17:15:02Z) - ConVQG: Contrastive Visual Question Generation with Multimodal Guidance [20.009626292937995]
We propose Contrastive Visual Question Generation (ConVQG) to generate image-grounded, text-guided, and knowledge-rich questions.
Experiments on knowledge-aware and standard VQG benchmarks demonstrate that ConVQG outperforms the state-of-the-art methods.
arXiv Detail & Related papers (2024-02-20T09:20:30Z) - Language Guided Visual Question Answering: Elevate Your Multimodal
Language Model Using Knowledge-Enriched Prompts [54.072432123447854]
Visual question answering (VQA) is the task of answering questions about an image.
Answering the question requires commonsense knowledge, world knowledge, and reasoning about ideas and concepts not present in the image.
We propose a framework that uses language guidance (LG) in the form of rationales, image captions, scene graphs, etc to answer questions more accurately.
arXiv Detail & Related papers (2023-10-31T03:54:11Z) - Dynamic Clue Bottlenecks: Towards Interpretable-by-Design Visual Question Answering [58.64831511644917]
We introduce an interpretable by design model that factors model decisions into intermediate human-legible explanations.
We show that our inherently interpretable system can improve 4.64% over a comparable black-box system in reasoning-focused questions.
arXiv Detail & Related papers (2023-05-24T08:33:15Z) - A-OKVQA: A Benchmark for Visual Question Answering using World Knowledge [39.788346536244504]
A-OKVQA is a crowdsourced dataset composed of about 25K questions.
We demonstrate the potential of this new dataset through a detailed analysis of its contents.
arXiv Detail & Related papers (2022-06-03T17:52:27Z) - Knowledge-Routed Visual Question Reasoning: Challenges for Deep
Representation Embedding [140.5911760063681]
We propose a novel dataset named Knowledge-Routed Visual Question Reasoning for VQA model evaluation.
We generate the question-answer pair based on both the Visual Genome scene graph and an external knowledge base with controlled programs.
arXiv Detail & Related papers (2020-12-14T00:33:44Z) - Point and Ask: Incorporating Pointing into Visual Question Answering [14.744503080484977]
We introduce and motivate point-input questions as an extension of Visual Question Answering (VQA)
Pointing is a nearly universal gesture among humans, and real-world VQA is likely to involve a gesture towards the target region.
We uncover and address several visual recognition challenges, including the ability to infer human intent.
arXiv Detail & Related papers (2020-11-27T11:43:45Z) - Loss re-scaling VQA: Revisiting the LanguagePrior Problem from a
Class-imbalance View [129.392671317356]
We propose to interpret the language prior problem in VQA from a class-imbalance view.
It explicitly reveals why the VQA model tends to produce a frequent yet obviously wrong answer.
We also justify the validity of the class imbalance interpretation scheme on other computer vision tasks, such as face recognition and image classification.
arXiv Detail & Related papers (2020-10-30T00:57:17Z) - SQuINTing at VQA Models: Introspecting VQA Models with Sub-Questions [66.86887670416193]
We show that state-of-the-art VQA models have comparable performance in answering perception and reasoning questions, but suffer from consistency problems.
To address this shortcoming, we propose an approach called Sub-Question-aware Network Tuning (SQuINT)
We show that SQuINT improves model consistency by 5%, also marginally improving performance on the Reasoning questions in VQA, while also displaying better attention maps.
arXiv Detail & Related papers (2020-01-20T01:02:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.