Right this way: Can VLMs Guide Us to See More to Answer Questions?
- URL: http://arxiv.org/abs/2411.00394v1
- Date: Fri, 01 Nov 2024 06:43:54 GMT
- Title: Right this way: Can VLMs Guide Us to See More to Answer Questions?
- Authors: Li Liu, Diji Yang, Sijia Zhong, Kalyana Suma Sree Tholeti, Lei Ding, Yi Zhang, Leilani H. Gilpin,
- Abstract summary: In question-answering scenarios, humans assess whether the available information is sufficient and seek additional information if necessary.
In contrast, Vision Language Models (VLMs) typically generate direct, one-shot responses without evaluating the sufficiency of the information.
This study demonstrates the potential to narrow the gap between information assessment and acquisition in VLMs, bringing their performance closer to humans.
- Score: 11.693356269848517
- License:
- Abstract: In question-answering scenarios, humans can assess whether the available information is sufficient and seek additional information if necessary, rather than providing a forced answer. In contrast, Vision Language Models (VLMs) typically generate direct, one-shot responses without evaluating the sufficiency of the information. To investigate this gap, we identify a critical and challenging task in the Visual Question Answering (VQA) scenario: can VLMs indicate how to adjust an image when the visual information is insufficient to answer a question? This capability is especially valuable for assisting visually impaired individuals who often need guidance to capture images correctly. To evaluate this capability of current VLMs, we introduce a human-labeled dataset as a benchmark for this task. Additionally, we present an automated framework that generates synthetic training data by simulating ``where to know'' scenarios. Our empirical results show significant performance improvements in mainstream VLMs when fine-tuned with this synthetic data. This study demonstrates the potential to narrow the gap between information assessment and acquisition in VLMs, bringing their performance closer to humans.
Related papers
- Combining Knowledge Graph and LLMs for Enhanced Zero-shot Visual Question Answering [20.16172308719101]
Zero-shot visual question answering (ZS-VQA) intends to answer visual questions without providing training samples.
Existing research in ZS-VQA has proposed to leverage knowledge graphs or large language models (LLMs) as external information sources.
We propose a novel design to combine knowledge graph and LLMs for zero-shot visual question answer.
arXiv Detail & Related papers (2025-01-22T08:14:11Z) - FiVL: A Framework for Improved Vision-Language Alignment [10.184567639685321]
We introduce FiVL, a novel method for constructing datasets designed to train LVLMs for enhanced visual grounding.
These datasets can be utilized for both training and assessing an LVLM's ability to use image content as substantive evidence.
To demonstrate the utility of our dataset, we introduce an innovative training task that outperforms baselines alongside a validation method and application for explainability.
arXiv Detail & Related papers (2024-12-19T09:24:10Z) - Scaling Inference-Time Search with Vision Value Model for Improved Visual Comprehension [95.63899307791665]
Vision Value Model (VisVM) can guide VLM inference-time search to generate responses with better visual comprehension.
In this paper, we present VisVM that can guide VLM inference-time search to generate responses with better visual comprehension.
arXiv Detail & Related papers (2024-12-04T20:35:07Z) - AutoBench-V: Can Large Vision-Language Models Benchmark Themselves? [65.92331309449015]
We introduce AutoBench-V, an automated framework for serving evaluation on demand, i.e., benchmarking LVLMs based on specific aspects of model capability.
Through an extensive evaluation of nine popular LVLMs across five demanded user inputs, the framework shows effectiveness and reliability.
arXiv Detail & Related papers (2024-10-28T17:55:08Z) - Trust but Verify: Programmatic VLM Evaluation in the Wild [62.14071929143684]
Programmatic VLM Evaluation (PROVE) is a new benchmarking paradigm for evaluating VLM responses to open-ended queries.
We benchmark the helpfulness-truthfulness trade-offs of a range ofVLMs on PROVE, finding that very few are in-fact able to achieve a good balance between the two.
arXiv Detail & Related papers (2024-10-17T01:19:18Z) - DriveLM: Driving with Graph Visual Question Answering [57.51930417790141]
We study how vision-language models (VLMs) trained on web-scale data can be integrated into end-to-end driving systems.
We propose a VLM-based baseline approach (DriveLM-Agent) for jointly performing Graph VQA and end-to-end driving.
arXiv Detail & Related papers (2023-12-21T18:59:12Z) - UNK-VQA: A Dataset and a Probe into the Abstention Ability of Multi-modal Large Models [55.22048505787125]
This paper contributes a comprehensive dataset, called UNK-VQA.
We first augment the existing data via deliberate perturbations on either the image or question.
We then extensively evaluate the zero- and few-shot performance of several emerging multi-modal large models.
arXiv Detail & Related papers (2023-10-17T02:38:09Z) - Rephrase, Augment, Reason: Visual Grounding of Questions for Vision-Language Models [59.05769810380928]
Rephrase, Augment and Reason (RepARe) is a gradient-free framework that extracts salient details about the image using the underlying vision-language model.
We show that RepARe can result in a 3.85% (absolute) increase in zero-shot accuracy on VQAv2, 6.41%, and 7.94% points increase on A-OKVQA, and VizWiz respectively.
arXiv Detail & Related papers (2023-10-09T16:57:57Z) - SimVQA: Exploring Simulated Environments for Visual Question Answering [15.030013924109118]
We explore using synthetic computer-generated data to fully control the visual and language space.
We quantify the effect of synthetic data in real-world VQA benchmarks and to which extent it produces results that generalize to real data.
We propose Feature Swapping (F-SWAP) -- where we randomly switch object-level features during training to make a VQA model more domain invariant.
arXiv Detail & Related papers (2022-03-31T17:44:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.