A Comprehensive Survey of Knowledge-Based Vision Question Answering Systems: The Lifecycle of Knowledge in Visual Reasoning Task
- URL: http://arxiv.org/abs/2504.17547v1
- Date: Thu, 24 Apr 2025 13:37:25 GMT
- Title: A Comprehensive Survey of Knowledge-Based Vision Question Answering Systems: The Lifecycle of Knowledge in Visual Reasoning Task
- Authors: Jiaqi Deng, Zonghan Wu, Huan Huo, Guandong Xu,
- Abstract summary: Knowledge-based Vision Question Answering (KB-VQA) extends general Vision Question Answering (VQA)<n>No comprehensive survey currently exists that systematically organizes and reviews the existing KB-VQA methods.
- Score: 15.932332484902103
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Knowledge-based Vision Question Answering (KB-VQA) extends general Vision Question Answering (VQA) by not only requiring the understanding of visual and textual inputs but also extensive range of knowledge, enabling significant advancements across various real-world applications. KB-VQA introduces unique challenges, including the alignment of heterogeneous information from diverse modalities and sources, the retrieval of relevant knowledge from noisy or large-scale repositories, and the execution of complex reasoning to infer answers from the combined context. With the advancement of Large Language Models (LLMs), KB-VQA systems have also undergone a notable transformation, where LLMs serve as powerful knowledge repositories, retrieval-augmented generators and strong reasoners. Despite substantial progress, no comprehensive survey currently exists that systematically organizes and reviews the existing KB-VQA methods. This survey aims to fill this gap by establishing a structured taxonomy of KB-VQA approaches, and categorizing the systems into main stages: knowledge representation, knowledge retrieval, and knowledge reasoning. By exploring various knowledge integration techniques and identifying persistent challenges, this work also outlines promising future research directions, providing a foundation for advancing KB-VQA models and their applications.
Related papers
- Fine-Grained Retrieval-Augmented Generation for Visual Question Answering [12.622529359686016]
Visual Question Answering (VQA) focuses on providing answers to natural language questions by utilizing information from images.<n>Retrieval-augmented generation (RAG) leveraging external knowledge bases (KBs) emerges as a promising approach.<n>This study presents fine-grained knowledge units, which merge textual snippets with entity images stored in vector databases.
arXiv Detail & Related papers (2025-02-28T11:25:38Z) - Assessing the Robustness of Retrieval-Augmented Generation Systems in K-12 Educational Question Answering with Knowledge Discrepancies [41.49674849980441]
We show that discrepancy between textbooks and parametric knowledge in Large Language Models could undermine the effectiveness of RAG systems.<n>We present EduKDQA, a question answering dataset that simulates knowledge discrepancies in real applications.<n>We find that most RAG systems suffer from a substantial performance drop in question answering with knowledge discrepancies.
arXiv Detail & Related papers (2024-12-12T06:38:40Z) - A Unified End-to-End Retriever-Reader Framework for Knowledge-based VQA [67.75989848202343]
This paper presents a unified end-to-end retriever-reader framework towards knowledge-based VQA.
We shed light on the multi-modal implicit knowledge from vision-language pre-training models to mine its potential in knowledge reasoning.
Our scheme is able to not only provide guidance for knowledge retrieval, but also drop these instances potentially error-prone towards question answering.
arXiv Detail & Related papers (2022-06-30T02:35:04Z) - REVIVE: Regional Visual Representation Matters in Knowledge-Based Visual
Question Answering [75.53187719777812]
This paper revisits visual representation in knowledge-based visual question answering (VQA)
We propose a new knowledge-based VQA method REVIVE, which tries to utilize the explicit information of object regions.
We achieve new state-of-the-art performance, i.e., 58.0% accuracy, surpassing previous state-of-the-art method by a large margin.
arXiv Detail & Related papers (2022-06-02T17:59:56Z) - A Survey on Complex Knowledge Base Question Answering: Methods,
Challenges and Solutions [41.680033017518376]
Knowledge base question answering (KBQA) aims to answer a question over a knowledge base (KB)
We elaborately summarize the typical challenges and solutions for complex KBQA.
arXiv Detail & Related papers (2021-05-25T03:45:30Z) - Contextualized Knowledge-aware Attentive Neural Network: Enhancing
Answer Selection with Knowledge [77.77684299758494]
We extensively investigate approaches to enhancing the answer selection model with external knowledge from knowledge graph (KG)
First, we present a context-knowledge interaction learning framework, Knowledge-aware Neural Network (KNN), which learns the QA sentence representations by considering a tight interaction with the external knowledge from KG and the textual information.
To handle the diversity and complexity of KG information, we propose a Contextualized Knowledge-aware Attentive Neural Network (CKANN), which improves the knowledge representation learning with structure information via a customized Graph Convolutional Network (GCN) and comprehensively learns context-based and knowledge-based sentence representation via
arXiv Detail & Related papers (2021-04-12T05:52:20Z) - Reasoning over Vision and Language: Exploring the Benefits of
Supplemental Knowledge [59.87823082513752]
This paper investigates the injection of knowledge from general-purpose knowledge bases (KBs) into vision-and-language transformers.
We empirically study the relevance of various KBs to multiple tasks and benchmarks.
The technique is model-agnostic and can expand the applicability of any vision-and-language transformer with minimal computational overhead.
arXiv Detail & Related papers (2021-01-15T08:37:55Z) - KRISP: Integrating Implicit and Symbolic Knowledge for Open-Domain
Knowledge-Based VQA [107.7091094498848]
One of the most challenging question types in VQA is when answering the question requires outside knowledge not present in the image.
In this work we study open-domain knowledge, the setting when the knowledge required to answer a question is not given/annotated, neither at training nor test time.
We tap into two types of knowledge representations and reasoning. First, implicit knowledge which can be learned effectively from unsupervised language pre-training and supervised training data with transformer-based models.
arXiv Detail & Related papers (2020-12-20T20:13:02Z) - A Survey on Complex Question Answering over Knowledge Base: Recent
Advances and Challenges [71.4531144086568]
Question Answering (QA) over Knowledge Base (KB) aims to automatically answer natural language questions.
Researchers have shifted their attention from simple questions to complex questions, which require more KB triples and constraint inference.
arXiv Detail & Related papers (2020-07-26T07:13:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.