Filling the Image Information Gap for VQA: Prompting Large Language
Models to Proactively Ask Questions
- URL: http://arxiv.org/abs/2311.11598v1
- Date: Mon, 20 Nov 2023 08:23:39 GMT
- Title: Filling the Image Information Gap for VQA: Prompting Large Language
Models to Proactively Ask Questions
- Authors: Ziyue Wang, Chi Chen, Peng Li, Yang Liu
- Abstract summary: Large Language Models (LLMs) demonstrate impressive reasoning ability and the maintenance of world knowledge.
As images are invisible to LLMs, researchers convert images to text to engage LLMs into the visual question reasoning procedure.
We design a framework that enables LLMs to proactively ask relevant questions to unveil more details in the image.
- Score: 15.262736501208467
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Language Models (LLMs) demonstrate impressive reasoning ability and the
maintenance of world knowledge not only in natural language tasks, but also in
some vision-language tasks such as open-domain knowledge-based visual question
answering (OK-VQA). As images are invisible to LLMs, researchers convert images
to text to engage LLMs into the visual question reasoning procedure. This leads
to discrepancies between images and their textual representations presented to
LLMs, which consequently impedes final reasoning performance. To fill the
information gap and better leverage the reasoning capability, we design a
framework that enables LLMs to proactively ask relevant questions to unveil
more details in the image, along with filters for refining the generated
information. We validate our idea on OK-VQA and A-OKVQA. Our method
continuously boosts the performance of baselines methods by an average gain of
2.15% on OK-VQA, and achieves consistent improvements across different LLMs.
Related papers
- SocialGPT: Prompting LLMs for Social Relation Reasoning via Greedy Segment Optimization [70.11167263638562]
Social relation reasoning aims to identify relation categories such as friends, spouses, and colleagues from images.
We first present a simple yet well-crafted framework named name, which combines the perception capability of Vision Foundation Models (VFMs) and the reasoning capability of Large Language Models (LLMs) within a modular framework.
arXiv Detail & Related papers (2024-10-28T18:10:26Z) - Declarative Knowledge Distillation from Large Language Models for Visual Question Answering Datasets [9.67464173044675]
Visual Question Answering (VQA) is the task of answering a question about an image.
We present an approach for declarative knowledge distillation from Large Language Models (LLMs)
Our results confirm that distilling knowledge from LLMs is in fact a promising direction besides data-driven rule learning approaches.
arXiv Detail & Related papers (2024-10-12T08:17:03Z) - Rethinking Visual Prompting for Multimodal Large Language Models with External Knowledge [76.45868419402265]
multimodal large language models (MLLMs) have made significant strides by training on vast high-quality image-text datasets.
However, the inherent difficulty in explicitly conveying fine-grained or spatially dense information in text, such as masks, poses a challenge for MLLMs.
This paper proposes a new visual prompt approach to integrate fine-grained external knowledge, gleaned from specialized vision models, into MLLMs.
arXiv Detail & Related papers (2024-07-05T17:43:30Z) - Good Questions Help Zero-Shot Image Reasoning [110.1671684828904]
Question-Driven Visual Exploration (QVix) is a novel prompting strategy that enhances the exploratory capabilities of large vision-language models (LVLMs)
QVix enables a wider exploration of visual scenes, improving the LVLMs' reasoning accuracy and depth in tasks such as visual question answering and visual entailment.
Our evaluations on various challenging zero-shot vision-language benchmarks, including ScienceQA and fine-grained visual classification, demonstrate that QVix significantly outperforms existing methods.
arXiv Detail & Related papers (2023-12-04T03:18:51Z) - What Large Language Models Bring to Text-rich VQA? [38.569505870771025]
Text-rich VQA, namely Visual Question Answering based on text recognition in the images, is a cross-modal task that requires both image comprehension and text recognition.
To address the above concern, we leverage external OCR models to recognize texts in the image and Large Language Models (LLMs) to answer the question given texts.
This pipeline achieved superior performance compared to the majority of existing Multimodal Large Language Models (MLLM) on four text-rich VQA datasets.
arXiv Detail & Related papers (2023-11-13T12:52:29Z) - Rephrase, Augment, Reason: Visual Grounding of Questions for Vision-Language Models [59.05769810380928]
Rephrase, Augment and Reason (RepARe) is a gradient-free framework that extracts salient details about the image using the underlying vision-language model.
We show that RepARe can result in a 3.85% (absolute) increase in zero-shot accuracy on VQAv2, 6.41%, and 7.94% points increase on A-OKVQA, and VizWiz respectively.
arXiv Detail & Related papers (2023-10-09T16:57:57Z) - Language Models as Knowledge Bases for Visual Word Sense Disambiguation [1.8591405259852054]
We propose some knowledge-enhancement techniques towards improving the retrieval performance of visiolinguistic (VL) transformers.
More specifically, knowledge stored in Large Language Models (LLMs) is retrieved with the help of appropriate prompts in a zero-shot manner.
Our presented approach is the first one to analyze the merits of exploiting knowledge stored in LLMs in different ways to solve Visual Word Sense Disambiguation.
arXiv Detail & Related papers (2023-10-03T11:11:55Z) - Tackling VQA with Pretrained Foundation Models without Further Training [0.0]
Large language models (LLMs) have achieved state-of-the-art results in many natural language processing tasks.
With the capability of these LLMs, researchers have looked into how to adopt them for use with Visual Question Answering (VQA)
In this paper, we explore a method of combining pretrained LLMs and other foundation models without further training to solve the VQA problem.
arXiv Detail & Related papers (2023-09-27T08:35:24Z) - LLM-CXR: Instruction-Finetuned LLM for CXR Image Understanding and Generation [51.08810811457617]
vision-language alignment in LLMs is actively being researched to enable multimodal reasoning and visual IO.
We develop a method for instruction-tuning an LLM only on text to gain vision-language capabilities for medical images.
Our model, LLM-CXR, trained in this approach shows better image-text alignment in both CXR understanding and generation tasks.
arXiv Detail & Related papers (2023-05-19T07:44:39Z) - From Images to Textual Prompts: Zero-shot VQA with Frozen Large Language
Models [111.42052290293965]
Large language models (LLMs) have demonstrated excellent zero-shot generalization to new language tasks.
End-to-end training on vision and language data may bridge the disconnections, but is inflexible and computationally expensive.
We propose emphImg2Prompt, a plug-and-play module that provides the prompts that can bridge the aforementioned modality and task disconnections.
arXiv Detail & Related papers (2022-12-21T08:39:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.