MMGraphRAG: Bridging Vision and Language with Interpretable Multimodal Knowledge Graphs
- URL: http://arxiv.org/abs/2507.20804v1
- Date: Mon, 28 Jul 2025 13:16:23 GMT
- Title: MMGraphRAG: Bridging Vision and Language with Interpretable Multimodal Knowledge Graphs
- Authors: Xueyao Wan, Hang Yu,
- Abstract summary: We propose MMGraphRAG, which refines visual content through scene graphs and constructs a multimodal knowledge graph.<n>It employs spectral clustering to achieve cross-modal entity linking and retrieves context along reasoning paths to guide the generative process.<n> Experimental results show that MMGraphRAG achieves state-of-the-art performance on the DocBench and MMLongBench datasets.
- Score: 6.165053219836395
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Retrieval-Augmented Generation (RAG) enhances language model generation by retrieving relevant information from external knowledge bases. However, conventional RAG methods face the issue of missing multimodal information. Multimodal RAG methods address this by fusing images and text through mapping them into a shared embedding space, but they fail to capture the structure of knowledge and logical chains between modalities. Moreover, they also require large-scale training for specific tasks, resulting in limited generalizing ability. To address these limitations, we propose MMGraphRAG, which refines visual content through scene graphs and constructs a multimodal knowledge graph (MMKG) in conjunction with text-based KG. It employs spectral clustering to achieve cross-modal entity linking and retrieves context along reasoning paths to guide the generative process. Experimental results show that MMGraphRAG achieves state-of-the-art performance on the DocBench and MMLongBench datasets, demonstrating strong domain adaptability and clear reasoning paths.
Related papers
- VimRAG: Navigating Massive Visual Context in Retrieval-Augmented Generation via Multimodal Memory Graph [42.348770377488094]
VimRAG is a framework tailored for multimodal Retrieval-augmented Reasoning across text, images, and videos.<n>We propose a Graph-Guided Policy Optimization strategy to disentangle step-wise validity from trajectory-level rewards.<n>Experiments demonstrate that VimRAG consistently achieves state-of-the-art performance on diverse multimodal RAG benchmarks.
arXiv Detail & Related papers (2026-02-13T09:05:09Z) - RAG-GFM: Overcoming In-Memory Bottlenecks in Graph Foundation Models via Retrieval-Augmented Generation [27.59455285600957]
Graph Foundation Models (GFMs) have emerged as a frontier in graph learning, which are expected to deliver transferable representations across diverse tasks.<n>We propose RAG-GFM, a Retrieval-Augmented Generation aided Graph Foundation Model that offloads knowledge from parameters.<n>We show that RAG-GFM consistently outperforms 13 state-of-the-art baselines in both cross-domain node and graph classification.
arXiv Detail & Related papers (2026-01-21T16:02:43Z) - MegaRAG: Multimodal Knowledge Graph-Based Retrieval Augmented Generation [17.382062394739588]
Large language models (LLMs) struggle with high-level conceptual understanding and holistic comprehension due to limited context windows.<n>We introduce a multimodal knowledge graph-based RAG that enables cross-modal reasoning for better content understanding.<n>Our method incorporates visual cues into the construction of knowledge graphs, the retrieval phase, and the answer generation process.
arXiv Detail & Related papers (2025-11-26T05:00:03Z) - Scaling Beyond Context: A Survey of Multimodal Retrieval-Augmented Generation for Document Understanding [61.36285696607487]
Document understanding is critical for applications from financial analysis to scientific discovery.<n>Current approaches, whether OCR-based pipelines feeding Large Language Models (LLMs) or native Multimodal LLMs (MLLMs) face key limitations.<n>Retrieval-Augmented Generation (RAG) helps ground models in external data, but documents' multimodal nature, combining text, tables, charts, and layout, demands a more advanced paradigm: Multimodal RAG.
arXiv Detail & Related papers (2025-10-17T02:33:16Z) - Multimodal RAG for Unstructured Data:Leveraging Modality-Aware Knowledge Graphs with Hybrid Retrieval [1.160208922584163]
We present a Modality-Aware Hybrid retrieval Architecture (MAHA) for multimodal question answering with reasoning through a modality-aware knowledge graph.<n>MAHA integrates dense vector retrieval with structured graph traversal, where the knowledge graph encodes cross-modal semantics and relationships.<n>Our work establishes a scalable and interpretable retrieval framework that advances RAG systems by enabling modality-aware reasoning over unstructured multimodal data.
arXiv Detail & Related papers (2025-10-16T11:55:24Z) - G-reasoner: Foundation Models for Unified Reasoning over Graph-structured Knowledge [88.82814893945077]
Large language models (LLMs) excel at complex reasoning but remain limited by static and incomplete parametric knowledge.<n>Recent graph-enhanced RAG (GraphRAG) attempts to bridge this gap by constructing tailored graphs and enabling LLMs to reason on them.<n>G-reasoner is a unified framework that integrates graph and language foundation models for reasoning over diverse graph-structured knowledge.
arXiv Detail & Related papers (2025-09-29T04:38:12Z) - Enrich-on-Graph: Query-Graph Alignment for Complex Reasoning with LLM Enriching [61.824094419641575]
Large Language Models (LLMs) struggle with hallucinations and factual errors in knowledge-intensive scenarios like knowledge graph question answering (KGQA)<n>We attribute this to the semantic gap between structured knowledge graphs (KGs) and unstructured queries, caused by inherent differences in their focuses and structures.<n>Existing methods usually employ resource-intensive, non-scalable reasoning on vanilla KGs, but overlook this gap.<n>We propose a flexible framework, Enrich-on-Graph (EoG), which leverages LLMs' prior knowledge to enrich KGs, bridge the semantic gap between graphs and queries.
arXiv Detail & Related papers (2025-09-25T06:48:52Z) - GRIL: Knowledge Graph Retrieval-Integrated Learning with Large Language Models [59.72897499248909]
We propose a novel graph retriever trained end-to-end with Large Language Models (LLMs)<n>Within the extracted subgraph, structural knowledge and semantic features are encoded via soft tokens and the verbalized graph, respectively, which are infused into the LLM together.<n>Our approach consistently achieves state-of-the-art performance, validating the strength of joint graph-LLM optimization for complex reasoning tasks.
arXiv Detail & Related papers (2025-09-20T02:38:00Z) - DSRAG: A Domain-Specific Retrieval Framework Based on Document-derived Multimodal Knowledge Graph [4.951890767337337]
This work focuses on a graph-based RAG framework, emphasizing the critical role of knowledge graph quality during the generation process.<n>We propose DSRAG, a multimodal knowledge graph-driven retrieval-augmented generation framework designed for domain-specific applications.
arXiv Detail & Related papers (2025-08-22T14:24:48Z) - Learning Efficient and Generalizable Graph Retriever for Knowledge-Graph Question Answering [75.12322966980003]
Large Language Models (LLMs) have shown strong inductive reasoning ability across various domains.<n>Most existing RAG pipelines rely on unstructured text, limiting interpretability and structured reasoning.<n>Recent studies have explored integrating knowledge graphs with LLMs for knowledge graph question answering.<n>We propose RAPL, a novel framework for efficient and effective graph retrieval in KGQA.
arXiv Detail & Related papers (2025-06-11T12:03:52Z) - VAT-KG: Knowledge-Intensive Multimodal Knowledge Graph Dataset for Retrieval-Augmented Generation [3.1033038923749774]
We propose the first concept-centric and knowledge-intensive multimodal knowledge graph that covers visual, audio, and text information.<n>Our construction pipeline ensures cross-modal knowledge alignment between multimodal data and fine-grained semantics.<n>We introduce a novel multimodal RAG framework that retrieves detailed concept-level knowledge in response to queries from arbitrary modalities.
arXiv Detail & Related papers (2025-06-11T07:22:57Z) - MLaGA: Multimodal Large Language and Graph Assistant [9.985787670804823]
Large Language Models (LLMs) have demonstrated substantial efficacy in advancing graph-structured data analysis.<n>We introduce the Multimodal Large Language and Graph Assistant (MLaGA), an innovative model that adeptly extends LLM capabilities to facilitate reasoning over complex graph structures and multimodal attributes.
arXiv Detail & Related papers (2025-06-03T07:52:00Z) - Align-GRAG: Reasoning-Guided Dual Alignment for Graph Retrieval-Augmented Generation [75.9865035064794]
Large language models (LLMs) have demonstrated remarkable capabilities, but still struggle with issues like hallucinations and outdated information.<n>Retrieval-augmented generation (RAG) addresses these issues by grounding LLM outputs in external knowledge with an Information Retrieval (IR) system.<n>We propose Align-GRAG, a novel reasoning-guided dual alignment framework in post-retrieval phrase.
arXiv Detail & Related papers (2025-05-22T05:15:27Z) - Divide by Question, Conquer by Agent: SPLIT-RAG with Question-Driven Graph Partitioning [62.640169289390535]
SPLIT-RAG is a multi-agent RAG framework that addresses the limitations with question-driven semantic graph partitioning and collaborative subgraph retrieval.<n>The innovative framework first create Semantic Partitioning of Linked Information, then use the Type-Specialized knowledge base to achieve Multi-Agent RAG.<n>The attribute-aware graph segmentation manages to divide knowledge graphs into semantically coherent subgraphs, ensuring subgraphs align with different query types.<n>A hierarchical merging module resolves inconsistencies across subgraph-derived answers through logical verifications.
arXiv Detail & Related papers (2025-05-20T06:44:34Z) - Aligning Vision to Language: Annotation-Free Multimodal Knowledge Graph Construction for Enhanced LLMs Reasoning [10.761218096540976]
Multimodal reasoning in Large Language Models (LLMs) struggles with incomplete knowledge and hallucination artifacts.<n>We propose Vision-align-to-Language integrated Knowledge Graph (VaLiK), a novel approach for constructing Multimodal Knowledge Graphs.
arXiv Detail & Related papers (2025-03-17T09:31:14Z) - Pseudo-Knowledge Graph: Meta-Path Guided Retrieval and In-Graph Text for RAG-Equipped LLM [8.941718961724984]
Pseudo-Knowledge Graph (PKG) framework integrates Meta-path Retrieval, In-graph Text and Vector Retrieval into Large Language Models.<n> PKG offers a richer knowledge representation and improves accuracy in information retrieval.
arXiv Detail & Related papers (2025-03-01T02:39:37Z) - UniGraph2: Learning a Unified Embedding Space to Bind Multimodal Graphs [34.48393396390799]
We propose a novel cross-domain graph foundation model that enables general representation learning on multimodal graphs.<n>UniGraph2 employs modality-specific encoders alongside a graph neural network (GNN) to learn a unified low-dimensional embedding space.<n>We show that UniGraph2 significantly outperforms state-of-the-art models in tasks such as representation learning, transfer learning, and multimodal generative tasks.
arXiv Detail & Related papers (2025-02-02T14:04:53Z) - All Against Some: Efficient Integration of Large Language Models for Message Passing in Graph Neural Networks [51.19110891434727]
Large Language Models (LLMs) with pretrained knowledge and powerful semantic comprehension abilities have recently shown a remarkable ability to benefit applications using vision and text data.
E-LLaGNN is a framework with an on-demand LLM service that enriches message passing procedure of graph learning by enhancing a limited fraction of nodes from the graph.
arXiv Detail & Related papers (2024-07-20T22:09:42Z) - Rethinking Visual Prompting for Multimodal Large Language Models with External Knowledge [76.45868419402265]
multimodal large language models (MLLMs) have made significant strides by training on vast high-quality image-text datasets.
However, the inherent difficulty in explicitly conveying fine-grained or spatially dense information in text, such as masks, poses a challenge for MLLMs.
This paper proposes a new visual prompt approach to integrate fine-grained external knowledge, gleaned from specialized vision models, into MLLMs.
arXiv Detail & Related papers (2024-07-05T17:43:30Z) - Tokenization, Fusion, and Augmentation: Towards Fine-grained Multi-modal Entity Representation [51.80447197290866]
Multi-modal knowledge graph completion (MMKGC) aims to discover unobserved knowledge from given knowledge graphs.<n>Existing MMKGC methods usually extract multi-modal features with pre-trained models.<n>We introduce a novel framework MyGO to tokenize, fuse, and augment the fine-grained multi-modal representations of entities.
arXiv Detail & Related papers (2024-04-15T05:40:41Z) - Compositional Chain-of-Thought Prompting for Large Multimodal Models [46.721769077885966]
Compositional Chain-of-Thought (CCoT) is a novel zero-shot Chain-of-Thought prompting method.
We first generate an SG using the Large Language Model (LLM) and then use that SG in the prompt to produce a response.
We find that the proposed CCoT approach not only improves LMM performance but also improves the performance of several popular LMMs on general multimodal benchmarks.
arXiv Detail & Related papers (2023-11-27T22:23:27Z) - Information Screening whilst Exploiting! Multimodal Relation Extraction
with Feature Denoising and Multimodal Topic Modeling [96.75821232222201]
Existing research on multimodal relation extraction (MRE) faces two co-existing challenges, internal-information over-utilization and external-information under-exploitation.
We propose a novel framework that simultaneously implements the idea of internal-information screening and external-information exploiting.
arXiv Detail & Related papers (2023-05-19T14:56:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.