Related papers: KG-ViP: Bridging Knowledge Grounding and Visual Perception in Multi-modal LLMs for Visual Question Answering

KG-ViP: Bridging Knowledge Grounding and Visual Perception in Multi-modal LLMs for Visual Question Answering

URL: http://arxiv.org/abs/2601.11632v1
Date: Wed, 14 Jan 2026 07:16:11 GMT
Title: KG-ViP: Bridging Knowledge Grounding and Visual Perception in Multi-modal LLMs for Visual Question Answering
Authors: Zhiyang Li, Ao Ke, Yukun Cao, Xike Xie,
Abstract summary: KG-ViP is a unified framework that empowers MLLMs by fusing scene graphs and commonsense graphs.<n>The core of the KG-ViP framework is a novel retrieval-and-fusion pipeline that utilizes the query as a semantic bridge to progressively integrate both graphs.
Score: 18.921632630913713
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multi-modal Large Language Models (MLLMs) for Visual Question Answering (VQA) often suffer from dual limitations: knowledge hallucination and insufficient fine-grained visual perception. Crucially, we identify that commonsense graphs and scene graphs provide precisely complementary solutions to these respective deficiencies by providing rich external knowledge and capturing fine-grained visual details. However, prior works typically treat them in isolation, overlooking their synergistic potential. To bridge this gap, we propose KG-ViP, a unified framework that empowers MLLMs by fusing scene graphs and commonsense graphs. The core of the KG-ViP framework is a novel retrieval-and-fusion pipeline that utilizes the query as a semantic bridge to progressively integrate both graphs, synthesizing a unified structured context that facilitates reliable multi-modal reasoning. Extensive experiments on FVQA 2.0+ and MVQA benchmarks demonstrate that KG-ViP significantly outperforms existing VQA methods.

Related papers

ReAG: Reasoning-Augmented Generation for Knowledge-based Visual Question Answering [54.72902502486611]
ReAG is a Reasoning-Augmented Multimodal RAG approach that combines coarse- and fine-grained retrieval with a critic model that filters irrelevant passages.<n>ReAG significantly outperforms prior methods, improving answer accuracy and providing interpretable reasoning grounded in retrieved evidence.
arXiv Detail & Related papers (2025-11-27T19:01:02Z)
MegaRAG: Multimodal Knowledge Graph-Based Retrieval Augmented Generation [17.382062394739588]
Large language models (LLMs) struggle with high-level conceptual understanding and holistic comprehension due to limited context windows.<n>We introduce a multimodal knowledge graph-based RAG that enables cross-modal reasoning for better content understanding.<n>Our method incorporates visual cues into the construction of knowledge graphs, the retrieval phase, and the answer generation process.
arXiv Detail & Related papers (2025-11-26T05:00:03Z)
Enrich-on-Graph: Query-Graph Alignment for Complex Reasoning with LLM Enriching [61.824094419641575]
Large Language Models (LLMs) struggle with hallucinations and factual errors in knowledge-intensive scenarios like knowledge graph question answering (KGQA)<n>We attribute this to the semantic gap between structured knowledge graphs (KGs) and unstructured queries, caused by inherent differences in their focuses and structures.<n>Existing methods usually employ resource-intensive, non-scalable reasoning on vanilla KGs, but overlook this gap.<n>We propose a flexible framework, Enrich-on-Graph (EoG), which leverages LLMs' prior knowledge to enrich KGs, bridge the semantic gap between graphs and queries.
arXiv Detail & Related papers (2025-09-25T06:48:52Z)
MMGraphRAG: Bridging Vision and Language with Interpretable Multimodal Knowledge Graphs [6.165053219836395]
We propose MMGraphRAG, which refines visual content through scene graphs and constructs a multimodal knowledge graph.<n>It employs spectral clustering to achieve cross-modal entity linking and retrieves context along reasoning paths to guide the generative process.<n> Experimental results show that MMGraphRAG achieves state-of-the-art performance on the DocBench and MMLongBench datasets.
arXiv Detail & Related papers (2025-07-28T13:16:23Z)
VAT-KG: Knowledge-Intensive Multimodal Knowledge Graph Dataset for Retrieval-Augmented Generation [16.248703946640735]
Multimodal Knowledge Graphs (MMKGs) represent explicit knowledge across multiple modalities.<n>Visual-Audio-Text Knowledge Graph (VAT-KG) is first concept-centric and knowledge-intensive multimodal knowledge graph.
arXiv Detail & Related papers (2025-06-11T07:22:57Z)
VisuRiddles: Fine-grained Perception is a Primary Bottleneck for Multimodal Large Language Models in Abstract Visual Reasoning [70.44416154144001]
Recent strides in multimodal large language models (MLLMs) have significantly advanced their performance in many reasoning tasks.<n> Abstract Visual Reasoning (AVR) remains a critical challenge, primarily due to limitations in perceiving abstract graphics.<n>We propose VisuRiddles, a benchmark for PRS, featuring tasks meticulously constructed to assess models' reasoning capacities.<n>Second, we introduce the Perceptual Riddle Synthesizer (PRS), an automated framework for generating riddles with fine-grained perceptual descriptions.
arXiv Detail & Related papers (2025-06-03T07:24:00Z)
VisDoM: Multi-Document QA with Visually Rich Elements Using Multimodal Retrieval-Augmented Generation [100.06122876025063]
This paper introduces VisDoMBench, the first comprehensive benchmark designed to evaluate QA systems in multi-document settings.<n>We propose VisDoMRAG, a novel multimodal Retrieval Augmented Generation (RAG) approach that simultaneously utilizes visual and textual RAG.
arXiv Detail & Related papers (2024-12-14T06:24:55Z)
Prism: A Framework for Decoupling and Assessing the Capabilities of VLMs [83.24033574914425]
We present Prism, an innovative framework designed to disentangle the perception and reasoning processes involved in visual question solving. Prism comprises two distinct stages: a perception stage that utilizes a VLM to extract and articulate visual information in textual form, and a reasoning stage that formulates responses based on the extracted visual information. Our analytical framework provides several valuable insights, underscoring Prism's potential as a cost-effective solution for vision-language tasks.
arXiv Detail & Related papers (2024-06-20T17:54:03Z)
Modality-Aware Integration with Large Language Models for Knowledge-based Visual Question Answering [28.48844388792774]
We present a novel modality-aware integration with large language models (LLMs) for KVQA (MAIL) MAIL carefully leverages multimodal knowledge for both image understanding and knowledge reasoning. Experiments on two benchmark datasets show the superiority of MAIL with 24x less resources.
arXiv Detail & Related papers (2024-02-20T05:32:24Z)
VQA-GNN: Reasoning with Multimodal Knowledge via Graph Neural Networks for Visual Question Answering [79.22069768972207]
We propose VQA-GNN, a new VQA model that performs bidirectional fusion between unstructured and structured multimodal knowledge to obtain unified knowledge representations. Specifically, we inter-connect the scene graph and the concept graph through a super node that represents the QA context. On two challenging VQA tasks, our method outperforms strong baseline VQA methods by 3.2% on VCR and 4.6% on GQA, suggesting its strength in performing concept-level reasoning.
arXiv Detail & Related papers (2022-05-23T17:55:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.