Related papers: Combining Knowledge Graph and LLMs for Enhanced Zero-shot Visual Question Answering

Combining Knowledge Graph and LLMs for Enhanced Zero-shot Visual Question Answering

URL: http://arxiv.org/abs/2501.12697v1
Date: Wed, 22 Jan 2025 08:14:11 GMT
Title: Combining Knowledge Graph and LLMs for Enhanced Zero-shot Visual Question Answering
Authors: Qian Tao, Xiaoyang Fan, Yong Xu, Xingquan Zhu, Yufei Tang,
Abstract summary: Zero-shot visual question answering (ZS-VQA) intends to answer visual questions without providing training samples.<n>Existing research in ZS-VQA has proposed to leverage knowledge graphs or large language models (LLMs) as external information sources.<n>We propose a novel design to combine knowledge graph and LLMs for zero-shot visual question answer.
Score: 20.16172308719101
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Zero-shot visual question answering (ZS-VQA), an emerged critical research area, intends to answer visual questions without providing training samples. Existing research in ZS-VQA has proposed to leverage knowledge graphs or large language models (LLMs), respectively, as external information sources to help VQA model comprehend images and questions. However, LLMs often struggle in accurately interpreting specific question meanings. Meanwhile, although knowledge graph has rich entity relationships, it is challenging to effectively connect entities to individual image content for visual question answers. In this paper, we propose a novel design to combine knowledge graph and LLMs for zero-shot visual question answer. Our approach uses LLMs' powerful understanding capabilities to accurately interpret image content through a strategic question search mechanism. Meanwhile, the knowledge graph is used to expand and connect users' queries to the image content for better visual question answering. An optimization algorithm is further used to determine the optimal weights for the loss functions derived from different information sources, towards a globally optimal set of candidate answers. Experimental results on two benchmark datasets demonstrate that our model achieves state-of-the-art (SOTA) performance. Both source code and benchmark data will be released for public access.

Related papers

Knowledge-based Visual Question Answer with Multimodal Processing, Retrieval and Filtering [55.49652734090316]
A knowledge-based visual question answering (KB-VQA) requires visual language models (VLMs) to integrate visual understanding with external knowledge retrieval.<n>We propose a novel three-stage method, termed Wiki-PRF, including Processing, Retrieval and Filtering stages.<n> Experiments on benchmark datasets (E-VQA and InfoSeek) show significant improvements in answer quality, achieving state-of-the-art performance.
arXiv Detail & Related papers (2025-10-16T12:10:00Z)
Picking the Cream of the Crop: Visual-Centric Data Selection with Collaborative Agents [62.616106562146776]
We propose a textbfVisual-Centric textbfSelection approach via textbfAgents Collaboration (ViSA) Our approach consists of 1) an image information quantification method via visual agents collaboration to select images with rich visual information, and 2) a visual-centric instruction quality assessment method to select high-quality instruction data related to high-quality images.
arXiv Detail & Related papers (2025-02-27T09:37:30Z)
Harnessing Large Language Models for Knowledge Graph Question Answering via Adaptive Multi-Aspect Retrieval-Augmentation [81.18701211912779]
We introduce an Adaptive Multi-Aspect Retrieval-augmented over KGs (Amar) framework.<n>This method retrieves knowledge including entities, relations, and subgraphs, and converts each piece of retrieved text into prompt embeddings.<n>Our method has achieved state-of-the-art performance on two common datasets.
arXiv Detail & Related papers (2024-12-24T16:38:04Z)
Right this way: Can VLMs Guide Us to See More to Answer Questions? [11.693356269848517]
In question-answering scenarios, humans assess whether the available information is sufficient and seek additional information if necessary. In contrast, Vision Language Models (VLMs) typically generate direct, one-shot responses without evaluating the sufficiency of the information. This study demonstrates the potential to narrow the gap between information assessment and acquisition in VLMs, bringing their performance closer to humans.
arXiv Detail & Related papers (2024-11-01T06:43:54Z)
Disentangling Knowledge-based and Visual Reasoning by Question Decomposition in KB-VQA [19.6585442152102]
We study the Knowledge-Based visual question-answering problem, for which given a question, the models need to ground it into the visual modality to find the answer. Our study shows that replacing a complex question with several simpler questions helps to extract more relevant information from the image.
arXiv Detail & Related papers (2024-06-27T02:19:38Z)
QAGCF: Graph Collaborative Filtering for Q&A Recommendation [58.21387109664593]
Question and answer (Q&A) platforms usually recommend question-answer pairs to meet users' knowledge acquisition needs. This makes user behaviors more complex, and presents two challenges for Q&A recommendation. We introduce Question & Answer Graph Collaborative Filtering (QAGCF), a graph neural network model that creates separate graphs for collaborative and semantic views.
arXiv Detail & Related papers (2024-06-07T10:52:37Z)
Filling the Image Information Gap for VQA: Prompting Large Language Models to Proactively Ask Questions [15.262736501208467]
Large Language Models (LLMs) demonstrate impressive reasoning ability and the maintenance of world knowledge. As images are invisible to LLMs, researchers convert images to text to engage LLMs into the visual question reasoning procedure. We design a framework that enables LLMs to proactively ask relevant questions to unveil more details in the image.
arXiv Detail & Related papers (2023-11-20T08:23:39Z)
Rephrase, Augment, Reason: Visual Grounding of Questions for Vision-Language Models [59.05769810380928]
Rephrase, Augment and Reason (RepARe) is a gradient-free framework that extracts salient details about the image using the underlying vision-language model. We show that RepARe can result in a 3.85% (absolute) increase in zero-shot accuracy on VQAv2, 6.41%, and 7.94% points increase on A-OKVQA, and VizWiz respectively.
arXiv Detail & Related papers (2023-10-09T16:57:57Z)
Analyzing the Efficacy of an LLM-Only Approach for Image-based Document Question Answering [12.064056743478865]
We study the relative contributions of the vision encoder and the language model in document question answering tasks. Our comprehensive analysis encompasses six diverse benchmark datasets, utilizing LLMs of varying scales. Our findings reveal that a strategy exclusively reliant on the LLM yields results that are on par with or closely approach state-of-the-art performance.
arXiv Detail & Related papers (2023-09-25T07:01:16Z)
AVIS: Autonomous Visual Information Seeking with Large Language Model Agent [123.75169211547149]
We propose an autonomous information seeking visual question answering framework, AVIS. Our method leverages a Large Language Model (LLM) to dynamically strategize the utilization of external tools. AVIS achieves state-of-the-art results on knowledge-intensive visual question answering benchmarks such as Infoseek and OK-VQA.
arXiv Detail & Related papers (2023-06-13T20:50:22Z)
Prophet: Prompting Large Language Models with Complementary Answer Heuristics for Knowledge-based Visual Question Answering [30.858737348472626]
Knowledge-based visual question answering (VQA) requires external knowledge beyond the image to answer the question. Recent works have resorted to using a powerful large language model (LLM) as an implicit knowledge engine to acquire the necessary knowledge for answering. We present a conceptually simple, flexible, and general framework designed to prompt LLM with answers for knowledge-based VQA.
arXiv Detail & Related papers (2023-03-03T13:05:15Z)
Knowledge-Routed Visual Question Reasoning: Challenges for Deep Representation Embedding [140.5911760063681]
We propose a novel dataset named Knowledge-Routed Visual Question Reasoning for VQA model evaluation. We generate the question-answer pair based on both the Visual Genome scene graph and an external knowledge base with controlled programs.
arXiv Detail & Related papers (2020-12-14T00:33:44Z)
Cross-modal Knowledge Reasoning for Knowledge-based Visual Question Answering [27.042604046441426]
Knowledge-based Visual Question Answering (KVQA) requires external knowledge beyond the visible content to answer questions about an image. In this paper, we depict an image by multiple knowledge graphs from the visual, semantic and factual views. We decompose the model into a series of memory-based reasoning steps, each performed by a G raph-based R ead, U pdate, and C ontrol. We achieve a new state-of-the-art performance on three popular benchmark datasets, including FVQA, Visual7W-KB and OK-VQA.
arXiv Detail & Related papers (2020-08-31T23:25:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.