Related papers: GC-KBVQA: A New Four-Stage Framework for Enhancing Knowledge Based Visual Question Answering Performance

GC-KBVQA: A New Four-Stage Framework for Enhancing Knowledge Based Visual Question Answering Performance

URL: http://arxiv.org/abs/2505.19354v1
Date: Sun, 25 May 2025 23:00:30 GMT
Title: GC-KBVQA: A New Four-Stage Framework for Enhancing Knowledge Based Visual Question Answering Performance
Authors: Mohammad Mahdi Moradi, Sudhir Mudur,
Abstract summary: Knowledge-Based Visual Question Answering (KB-VQA) methods focus on tasks that demand reasoning with information extending beyond the explicit content depicted in the image.<n>Recent approaches leverage Large Language Models (LLMs) as implicit knowledge sources.<n>We introduce a novel four-stage framework called Grounding Caption-Guided Knowledge-Based Visual Question Answering (GC-KBVQA)<n> Innovations include grounding question-aware caption generation to move beyond generic descriptions and have compact, yet detailed and context-rich information.
Score: 0.9208007322096533
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Knowledge-Based Visual Question Answering (KB-VQA) methods focus on tasks that demand reasoning with information extending beyond the explicit content depicted in the image. Early methods relied on explicit knowledge bases to provide this auxiliary information. Recent approaches leverage Large Language Models (LLMs) as implicit knowledge sources. While KB-VQA methods have demonstrated promising results, their potential remains constrained as the auxiliary text provided may not be relevant to the question context, and may also include irrelevant information that could misguide the answer predictor. We introduce a novel four-stage framework called Grounding Caption-Guided Knowledge-Based Visual Question Answering (GC-KBVQA), which enables LLMs to effectively perform zero-shot VQA tasks without the need for end-to-end multimodal training. Innovations include grounding question-aware caption generation to move beyond generic descriptions and have compact, yet detailed and context-rich information. This is combined with knowledge from external sources to create highly informative prompts for the LLM. GC-KBVQA can address a variety of VQA tasks, and does not require task-specific fine-tuning, thus reducing both costs and deployment complexity by leveraging general-purpose, pre-trained LLMs. Comparison with competing KB-VQA methods shows significantly improved performance. Our code will be made public.

Related papers

SPARQL Query Generation with LLMs: Measuring the Impact of Training Data Memorization and Knowledge Injection [81.78173888579941]
Large Language Models (LLMs) are considered a well-suited method to increase the quality of the question-answering functionality.<n>LLMs are trained on web data, where researchers have no control over whether the benchmark or the knowledge graph was already included in the training data.<n>This paper introduces a novel method that evaluates the quality of LLMs by generating a SPARQL query from a natural-language question.
arXiv Detail & Related papers (2025-07-18T12:28:08Z)
A Comprehensive Survey of Knowledge-Based Vision Question Answering Systems: The Lifecycle of Knowledge in Visual Reasoning Task [15.932332484902103]
Knowledge-based Vision Question Answering (KB-VQA) extends general Vision Question Answering (VQA)<n>No comprehensive survey currently exists that systematically organizes and reviews the existing KB-VQA methods.
arXiv Detail & Related papers (2025-04-24T13:37:25Z)
Question-Aware Knowledge Graph Prompting for Enhancing Large Language Models [51.47994645529258]
We propose Question-Aware Knowledge Graph Prompting (QAP), which incorporates question embeddings into GNN aggregation to dynamically assess KG relevance.<n> Experimental results demonstrate that QAP outperforms state-of-the-art methods across multiple datasets, highlighting its effectiveness.
arXiv Detail & Related papers (2025-03-30T17:09:11Z)
Fine-Grained Retrieval-Augmented Generation for Visual Question Answering [12.622529359686016]
Visual Question Answering (VQA) focuses on providing answers to natural language questions by utilizing information from images.<n>Retrieval-augmented generation (RAG) leveraging external knowledge bases (KBs) emerges as a promising approach.<n>This study presents fine-grained knowledge units, which merge textual snippets with entity images stored in vector databases.
arXiv Detail & Related papers (2025-02-28T11:25:38Z)
Learning to Compress Contexts for Efficient Knowledge-based Visual Question Answering [44.54319663913782]
We propose textbfRetrieval-textbfAugmented MLLMs with Compressed Contexts (RACC)<n>RACC learns to compress and aggregate retrieved knowledge for a given image-question pair.<n>It achieves a state-of-the-art (SOTA) performance of 63.92% on OK-VQA.
arXiv Detail & Related papers (2024-09-11T15:11:39Z)
A Simple Baseline for Knowledge-Based Visual Question Answering [78.00758742784532]
This paper is on the problem of Knowledge-Based Visual Question Answering (KB-VQA) Our main contribution in this paper is to propose a much simpler and readily reproducible pipeline. Contrary to recent approaches, our method is training-free, does not require access to external databases or APIs, and achieves state-of-the-art accuracy on the OK-VQA and A-OK-VQA datasets.
arXiv Detail & Related papers (2023-10-20T15:08:17Z)
Open-Set Knowledge-Based Visual Question Answering with Inference Paths [79.55742631375063]
The purpose of Knowledge-Based Visual Question Answering (KB-VQA) is to provide a correct answer to the question with the aid of external knowledge bases. We propose a new retriever-ranker paradigm of KB-VQA, Graph pATH rankER (GATHER for brevity) Specifically, it contains graph constructing, pruning, and path-level ranking, which not only retrieves accurate answers but also provides inference paths that explain the reasoning process.
arXiv Detail & Related papers (2023-10-12T09:12:50Z)
Make a Choice! Knowledge Base Question Answering with In-Context Learning [1.7827767384590838]
Question answering over knowledge bases (KBQA) aims to answer factoid questions with a given knowledge base (KB) Due to the large scale of KB, annotated data is impossible to cover all fact schemas in KB. We present McL-KBQA, a framework that incorporates the few-shot ability of LLM into the KBQA method via ICL-based multiple choice.
arXiv Detail & Related papers (2023-05-23T11:56:03Z)
Prophet: Prompting Large Language Models with Complementary Answer Heuristics for Knowledge-based Visual Question Answering [28.763437313766996]
Knowledge-based visual question answering (VQA) requires external knowledge beyond the image to answer the question.<n>Recent works have resorted to using a powerful large language model (LLM) as an implicit knowledge engine to acquire the necessary knowledge for answering.<n>We present Prophet -- a conceptually simple, flexible, and general framework designed to prompt with answers for knowledge-based VQA.
arXiv Detail & Related papers (2023-03-03T13:05:15Z)
VQA-GNN: Reasoning with Multimodal Knowledge via Graph Neural Networks for Visual Question Answering [79.22069768972207]
We propose VQA-GNN, a new VQA model that performs bidirectional fusion between unstructured and structured multimodal knowledge to obtain unified knowledge representations. Specifically, we inter-connect the scene graph and the concept graph through a super node that represents the QA context. On two challenging VQA tasks, our method outperforms strong baseline VQA methods by 3.2% on VCR and 4.6% on GQA, suggesting its strength in performing concept-level reasoning.
arXiv Detail & Related papers (2022-05-23T17:55:34Z)
Improving and Diagnosing Knowledge-Based Visual Question Answering via Entity Enhanced Knowledge Injection [14.678153928301493]
Knowledge-Based Visual Question Answering (KBVQA) is a bi-modal task requiring external world knowledge in order to correctly answer a text question and associated image. Recent single text work has shown knowledge injection into pre-trained language models, specifically entity enhanced knowledge graph embeddings, can improve performance on downstream entity-centric tasks.
arXiv Detail & Related papers (2021-12-13T18:45:42Z)
KRISP: Integrating Implicit and Symbolic Knowledge for Open-Domain Knowledge-Based VQA [107.7091094498848]
One of the most challenging question types in VQA is when answering the question requires outside knowledge not present in the image. In this work we study open-domain knowledge, the setting when the knowledge required to answer a question is not given/annotated, neither at training nor test time. We tap into two types of knowledge representations and reasoning. First, implicit knowledge which can be learned effectively from unsupervised language pre-training and supervised training data with transformer-based models.
arXiv Detail & Related papers (2020-12-20T20:13:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.