Related papers: Enhancing Visual Question Answering through Question-Driven Image Captions as Prompts

Enhancing Visual Question Answering through Question-Driven Image Captions as Prompts

URL: http://arxiv.org/abs/2404.08589v1
Date: Fri, 12 Apr 2024 16:35:23 GMT
Title: Enhancing Visual Question Answering through Question-Driven Image Captions as Prompts
Authors: Övgü Özdemir, Erdem Akagündüz,
Abstract summary: Visual question answering (VQA) is known as an AI-complete task as it requires understanding, reasoning, and inferring about the vision and the language content. This study explores the impact of incorporating image captioning as an intermediary process within the VQA pipeline.
Score: 3.6064695344878093
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Visual question answering (VQA) is known as an AI-complete task as it requires understanding, reasoning, and inferring about the vision and the language content. Over the past few years, numerous neural architectures have been suggested for the VQA problem. However, achieving success in zero-shot VQA remains a challenge due to its requirement for advanced generalization and reasoning skills. This study explores the impact of incorporating image captioning as an intermediary process within the VQA pipeline. Specifically, we explore the efficacy of utilizing image captions instead of images and leveraging large language models (LLMs) to establish a zero-shot setting. Since image captioning is the most crucial step in this process, we compare the impact of state-of-the-art image captioning models on VQA performance across various question types in terms of structure and semantics. We propose a straightforward and efficient question-driven image captioning approach within this pipeline to transfer contextual information into the question-answering (QA) model. This method involves extracting keywords from the question, generating a caption for each image-question pair using the keywords, and incorporating the question-driven caption into the LLM prompt. We evaluate the efficacy of using general-purpose and question-driven image captions in the VQA pipeline. Our study highlights the potential of employing image captions and harnessing the capabilities of LLMs to achieve competitive performance on GQA under the zero-shot setting. Our code is available at \url{https://github.com/ovguyo/captions-in-VQA}.

Related papers

CaptionQA: Is Your Caption as Useful as the Image Itself? [39.852352842429376]
Image captions serve as efficient surrogates for visual content in systems such as retrieval, recommendation, and multi-step agentic inference pipelines.<n>We propose a utility-based benchmark, CaptionQA, to evaluate model-generated captions.<n>We release CaptionQA along with an open-source pipeline for extension to new domains.
arXiv Detail & Related papers (2025-11-26T03:43:32Z)
SCRA-VQA: Summarized Caption-Rerank for Augmented Large Language Models in Visual Question Answering [15.985057987715974]
We propose the Summarized Caption-Rerank Augmented VQA (SCRA-VQA)<n>SCRA-VQA employs a pre-trained visual language model to convert images into captions.<n>It generates contextual examples for the captions while simultaneously summarizing and reordering them to exclude unrelated information.
arXiv Detail & Related papers (2025-09-25T08:01:28Z)
Are VLMs Really Blind [3.052971829873887]
Vision Language Models excel in handling a wide range of complex tasks. These models fail to perform well on low-level basic visual tasks. Our work presents a novel automatic pipeline designed to extract key information from images in response to specific questions.
arXiv Detail & Related papers (2024-10-29T13:20:50Z)
Language Guided Visual Question Answering: Elevate Your Multimodal Language Model Using Knowledge-Enriched Prompts [54.072432123447854]
Visual question answering (VQA) is the task of answering questions about an image. Answering the question requires commonsense knowledge, world knowledge, and reasoning about ideas and concepts not present in the image. We propose a framework that uses language guidance (LG) in the form of rationales, image captions, scene graphs, etc to answer questions more accurately.
arXiv Detail & Related papers (2023-10-31T03:54:11Z)
Rephrase, Augment, Reason: Visual Grounding of Questions for Vision-Language Models [59.05769810380928]
Rephrase, Augment and Reason (RepARe) is a gradient-free framework that extracts salient details about the image using the underlying vision-language model. We show that RepARe can result in a 3.85% (absolute) increase in zero-shot accuracy on VQAv2, 6.41%, and 7.94% points increase on A-OKVQA, and VizWiz respectively.
arXiv Detail & Related papers (2023-10-09T16:57:57Z)
UniFine: A Unified and Fine-grained Approach for Zero-shot Vision-Language Understanding [84.83494254263138]
We propose a unified framework to take advantage of the fine-grained information for zero-shot vision-language learning. Our framework outperforms former zero-shot methods on VQA and achieves substantial improvement on SNLI-VE and VCR.
arXiv Detail & Related papers (2023-07-03T09:03:12Z)
PromptCap: Prompt-Guided Task-Aware Image Captioning [118.39243917422492]
We propose PromptCap, a captioning model designed to serve as a better connector between images and black-box LMs. PromptCap takes a natural-language prompt to control the visual entities to describe in the generated caption. We demonstrate PromptCap's effectiveness on an existing pipeline in which GPT-3 is prompted with image captions to carry out VQA.
arXiv Detail & Related papers (2022-11-15T19:07:53Z)
An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA [51.639880603821446]
We propose PICa, a simple yet effective method that Prompts GPT3 via the use of Image Captions for knowledge-based VQA. We first convert the image into captions (or tags) that GPT-3 can understand, then adapt GPT-3 to solve the VQA task in a few-shot manner. By using only 16 examples, PICa surpasses the supervised state of the art by an absolute +8.6 points on the OK-VQA dataset.
arXiv Detail & Related papers (2021-09-10T17:51:06Z)
CapWAP: Captioning with a Purpose [56.99405135645775]
We propose a new task, Captioning with a Purpose (CapWAP) Our goal is to develop systems that can be tailored to be useful for the information needs of an intended population. We show that it is possible to use reinforcement learning to directly optimize for the intended information need.
arXiv Detail & Related papers (2020-11-09T09:23:55Z)
Pragmatic Issue-Sensitive Image Captioning [11.998287522410404]
We propose Issue-Sensitive Image Captioning (ISIC) ISIC is a captioning system given a target image and an issue, which is a set of images partitioned in a way that specifies what information is relevant. We show how ISIC can complement and enrich the related task of Visual Question Answering.
arXiv Detail & Related papers (2020-04-29T20:00:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.