Combining Knowledge Graph and LLMs for Enhanced Zero-shot Visual Question Answering
- URL: http://arxiv.org/abs/2501.12697v1
- Date: Wed, 22 Jan 2025 08:14:11 GMT
- Title: Combining Knowledge Graph and LLMs for Enhanced Zero-shot Visual Question Answering
- Authors: Qian Tao, Xiaoyang Fan, Yong Xu, Xingquan Zhu, Yufei Tang,
- Abstract summary: Zero-shot visual question answering (ZS-VQA) intends to answer visual questions without providing training samples.
Existing research in ZS-VQA has proposed to leverage knowledge graphs or large language models (LLMs) as external information sources.
We propose a novel design to combine knowledge graph and LLMs for zero-shot visual question answer.
- Score: 20.16172308719101
- License:
- Abstract: Zero-shot visual question answering (ZS-VQA), an emerged critical research area, intends to answer visual questions without providing training samples. Existing research in ZS-VQA has proposed to leverage knowledge graphs or large language models (LLMs), respectively, as external information sources to help VQA model comprehend images and questions. However, LLMs often struggle in accurately interpreting specific question meanings. Meanwhile, although knowledge graph has rich entity relationships, it is challenging to effectively connect entities to individual image content for visual question answers. In this paper, we propose a novel design to combine knowledge graph and LLMs for zero-shot visual question answer. Our approach uses LLMs' powerful understanding capabilities to accurately interpret image content through a strategic question search mechanism. Meanwhile, the knowledge graph is used to expand and connect users' queries to the image content for better visual question answering. An optimization algorithm is further used to determine the optimal weights for the loss functions derived from different information sources, towards a globally optimal set of candidate answers. Experimental results on two benchmark datasets demonstrate that our model achieves state-of-the-art (SOTA) performance. Both source code and benchmark data will be released for public access.
Related papers
- Harnessing Large Language Models for Knowledge Graph Question Answering via Adaptive Multi-Aspect Retrieval-Augmentation [81.18701211912779]
We introduce an Adaptive Multi-Aspect Retrieval-augmented over KGs (Amar) framework.
This method retrieves knowledge including entities, relations, and subgraphs, and converts each piece of retrieved text into prompt embeddings.
Our method has achieved state-of-the-art performance on two common datasets.
arXiv Detail & Related papers (2024-12-24T16:38:04Z) - Right this way: Can VLMs Guide Us to See More to Answer Questions? [11.693356269848517]
In question-answering scenarios, humans assess whether the available information is sufficient and seek additional information if necessary.
In contrast, Vision Language Models (VLMs) typically generate direct, one-shot responses without evaluating the sufficiency of the information.
This study demonstrates the potential to narrow the gap between information assessment and acquisition in VLMs, bringing their performance closer to humans.
arXiv Detail & Related papers (2024-11-01T06:43:54Z) - Disentangling Knowledge-based and Visual Reasoning by Question Decomposition in KB-VQA [19.6585442152102]
We study the Knowledge-Based visual question-answering problem, for which given a question, the models need to ground it into the visual modality to find the answer.
Our study shows that replacing a complex question with several simpler questions helps to extract more relevant information from the image.
arXiv Detail & Related papers (2024-06-27T02:19:38Z) - QAGCF: Graph Collaborative Filtering for Q&A Recommendation [58.21387109664593]
Question and answer (Q&A) platforms usually recommend question-answer pairs to meet users' knowledge acquisition needs.
This makes user behaviors more complex, and presents two challenges for Q&A recommendation.
We introduce Question & Answer Graph Collaborative Filtering (QAGCF), a graph neural network model that creates separate graphs for collaborative and semantic views.
arXiv Detail & Related papers (2024-06-07T10:52:37Z) - Good Questions Help Zero-Shot Image Reasoning [110.1671684828904]
Question-Driven Visual Exploration (QVix) is a novel prompting strategy that enhances the exploratory capabilities of large vision-language models (LVLMs)
QVix enables a wider exploration of visual scenes, improving the LVLMs' reasoning accuracy and depth in tasks such as visual question answering and visual entailment.
Our evaluations on various challenging zero-shot vision-language benchmarks, including ScienceQA and fine-grained visual classification, demonstrate that QVix significantly outperforms existing methods.
arXiv Detail & Related papers (2023-12-04T03:18:51Z) - Filling the Image Information Gap for VQA: Prompting Large Language
Models to Proactively Ask Questions [15.262736501208467]
Large Language Models (LLMs) demonstrate impressive reasoning ability and the maintenance of world knowledge.
As images are invisible to LLMs, researchers convert images to text to engage LLMs into the visual question reasoning procedure.
We design a framework that enables LLMs to proactively ask relevant questions to unveil more details in the image.
arXiv Detail & Related papers (2023-11-20T08:23:39Z) - Rephrase, Augment, Reason: Visual Grounding of Questions for Vision-Language Models [59.05769810380928]
Rephrase, Augment and Reason (RepARe) is a gradient-free framework that extracts salient details about the image using the underlying vision-language model.
We show that RepARe can result in a 3.85% (absolute) increase in zero-shot accuracy on VQAv2, 6.41%, and 7.94% points increase on A-OKVQA, and VizWiz respectively.
arXiv Detail & Related papers (2023-10-09T16:57:57Z) - AVIS: Autonomous Visual Information Seeking with Large Language Model
Agent [123.75169211547149]
We propose an autonomous information seeking visual question answering framework, AVIS.
Our method leverages a Large Language Model (LLM) to dynamically strategize the utilization of external tools.
AVIS achieves state-of-the-art results on knowledge-intensive visual question answering benchmarks such as Infoseek and OK-VQA.
arXiv Detail & Related papers (2023-06-13T20:50:22Z) - Prophet: Prompting Large Language Models with Complementary Answer
Heuristics for Knowledge-based Visual Question Answering [30.858737348472626]
Knowledge-based visual question answering (VQA) requires external knowledge beyond the image to answer the question.
Recent works have resorted to using a powerful large language model (LLM) as an implicit knowledge engine to acquire the necessary knowledge for answering.
We present a conceptually simple, flexible, and general framework designed to prompt LLM with answers for knowledge-based VQA.
arXiv Detail & Related papers (2023-03-03T13:05:15Z) - Knowledge-Routed Visual Question Reasoning: Challenges for Deep
Representation Embedding [140.5911760063681]
We propose a novel dataset named Knowledge-Routed Visual Question Reasoning for VQA model evaluation.
We generate the question-answer pair based on both the Visual Genome scene graph and an external knowledge base with controlled programs.
arXiv Detail & Related papers (2020-12-14T00:33:44Z) - Cross-modal Knowledge Reasoning for Knowledge-based Visual Question
Answering [27.042604046441426]
Knowledge-based Visual Question Answering (KVQA) requires external knowledge beyond the visible content to answer questions about an image.
In this paper, we depict an image by multiple knowledge graphs from the visual, semantic and factual views.
We decompose the model into a series of memory-based reasoning steps, each performed by a G raph-based R ead, U pdate, and C ontrol.
We achieve a new state-of-the-art performance on three popular benchmark datasets, including FVQA, Visual7W-KB and OK-VQA.
arXiv Detail & Related papers (2020-08-31T23:25:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.