Can Pre-trained Vision and Language Models Answer Visual
Information-Seeking Questions?
- URL: http://arxiv.org/abs/2302.11713v5
- Date: Tue, 17 Oct 2023 14:19:13 GMT
- Title: Can Pre-trained Vision and Language Models Answer Visual
Information-Seeking Questions?
- Authors: Yang Chen, Hexiang Hu, Yi Luan, Haitian Sun, Soravit Changpinyo, Alan
Ritter, Ming-Wei Chang
- Abstract summary: We introduce InfoSeek, a visual question answering dataset tailored for information-seeking questions.
We analyze various pre-trained visual question answering models and gain insights into their characteristics.
We show that accurate visual entity recognition can be used to improve performance on InfoSeek by retrieving relevant documents.
- Score: 50.29862466940209
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Pre-trained vision and language models have demonstrated state-of-the-art
capabilities over existing tasks involving images and texts, including visual
question answering. However, it remains unclear whether these models possess
the capability to answer questions that are not only querying visual content
but knowledge-intensive and information-seeking. In this study, we introduce
InfoSeek, a visual question answering dataset tailored for information-seeking
questions that cannot be answered with only common sense knowledge. Using
InfoSeek, we analyze various pre-trained visual question answering models and
gain insights into their characteristics. Our findings reveal that
state-of-the-art pre-trained multi-modal models (e.g., PaLI-X, BLIP2, etc.)
face challenges in answering visual information-seeking questions, but
fine-tuning on the InfoSeek dataset elicits models to use fine-grained
knowledge that was learned during their pre-training. Furthermore, we show that
accurate visual entity recognition can be used to improve performance on
InfoSeek by retrieving relevant documents, showing a significant space for
improvement.
Related papers
- EchoSight: Advancing Visual-Language Models with Wiki Knowledge [39.02148880719576]
We introduce EchoSight, a novel framework for knowledge-based Visual Question Answering.
To strive for high-performing retrieval, EchoSight first searches wiki articles by using visual-only information.
Our experimental results on the Encyclopedic VQA and InfoSeek datasets demonstrate that EchoSight establishes new state-of-the-art results in knowledge-based VQA.
arXiv Detail & Related papers (2024-07-17T16:55:42Z) - Extracting Training Data from Document-Based VQA Models [67.1470112451617]
Vision-Language Models (VLMs) have made remarkable progress in document-based Visual Question Answering (i.e., responding to queries about the contents of an input document provided as an image)
We show these models can memorise responses for training samples and regurgitate them even when the relevant visual information has been removed.
This includes Personal Identifiable Information repeated once in the training set, indicating these models could divulge sensitive information and therefore pose a privacy risk.
arXiv Detail & Related papers (2024-07-11T17:44:41Z) - Disentangling Knowledge-based and Visual Reasoning by Question Decomposition in KB-VQA [19.6585442152102]
We study the Knowledge-Based visual question-answering problem, for which given a question, the models need to ground it into the visual modality to find the answer.
Our study shows that replacing a complex question with several simpler questions helps to extract more relevant information from the image.
arXiv Detail & Related papers (2024-06-27T02:19:38Z) - SQ-LLaVA: Self-Questioning for Large Vision-Language Assistant [48.220285886328746]
We introduce a novel framework named SQ-LLaVA: Self-Questioning for Large Vision-Language Assistant.
SQ-LLaVA exhibits proficiency in generating flexible and meaningful image-related questions while analyzing the visual clue and prior language knowledge.
Fine-tuning SQ-LLaVA on higher-quality instruction data shows a performance improvement compared with traditional visual-instruction tuning methods.
arXiv Detail & Related papers (2024-03-17T18:42:38Z) - REVIVE: Regional Visual Representation Matters in Knowledge-Based Visual
Question Answering [75.53187719777812]
This paper revisits visual representation in knowledge-based visual question answering (VQA)
We propose a new knowledge-based VQA method REVIVE, which tries to utilize the explicit information of object regions.
We achieve new state-of-the-art performance, i.e., 58.0% accuracy, surpassing previous state-of-the-art method by a large margin.
arXiv Detail & Related papers (2022-06-02T17:59:56Z) - Leveraging Visual Knowledge in Language Tasks: An Empirical Study on
Intermediate Pre-training for Cross-modal Knowledge Transfer [61.34424171458634]
We study whether integrating visual knowledge into a language model can fill the gap.
Our experiments show that visual knowledge transfer can improve performance in both low-resource and fully supervised settings.
arXiv Detail & Related papers (2022-03-14T22:02:40Z) - Reasoning over Vision and Language: Exploring the Benefits of
Supplemental Knowledge [59.87823082513752]
This paper investigates the injection of knowledge from general-purpose knowledge bases (KBs) into vision-and-language transformers.
We empirically study the relevance of various KBs to multiple tasks and benchmarks.
The technique is model-agnostic and can expand the applicability of any vision-and-language transformer with minimal computational overhead.
arXiv Detail & Related papers (2021-01-15T08:37:55Z) - A Dataset and Baselines for Visual Question Answering on Art [33.14114180168856]
We introduce our first attempt towards building a new dataset, coined AQUA (Art QUestion Answering)
The question-answer (QA) pairs are automatically generated using state-of-the-art question generation methods.
Our dataset inherently consists of visual (painting-based) and knowledge (comment-based) questions.
arXiv Detail & Related papers (2020-08-28T07:33:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.