Multimodal RAG for Unstructured Data:Leveraging Modality-Aware Knowledge Graphs with Hybrid Retrieval
- URL: http://arxiv.org/abs/2510.14592v1
- Date: Thu, 16 Oct 2025 11:55:24 GMT
- Title: Multimodal RAG for Unstructured Data:Leveraging Modality-Aware Knowledge Graphs with Hybrid Retrieval
- Authors: Rashmi R, Vidyadhar Upadhya,
- Abstract summary: We present a Modality-Aware Hybrid retrieval Architecture (MAHA) for multimodal question answering with reasoning through a modality-aware knowledge graph.<n>MAHA integrates dense vector retrieval with structured graph traversal, where the knowledge graph encodes cross-modal semantics and relationships.<n>Our work establishes a scalable and interpretable retrieval framework that advances RAG systems by enabling modality-aware reasoning over unstructured multimodal data.
- Score: 1.160208922584163
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Current Retrieval-Augmented Generation (RAG) systems primarily operate on unimodal textual data, limiting their effectiveness on unstructured multimodal documents. Such documents often combine text, images, tables, equations, and graphs, each contributing unique information. In this work, we present a Modality-Aware Hybrid retrieval Architecture (MAHA), designed specifically for multimodal question answering with reasoning through a modality-aware knowledge graph. MAHA integrates dense vector retrieval with structured graph traversal, where the knowledge graph encodes cross-modal semantics and relationships. This design enables both semantically rich and context-aware retrieval across diverse modalities. Evaluations on multiple benchmark datasets demonstrate that MAHA substantially outperforms baseline methods, achieving a ROUGE-L score of 0.486, providing complete modality coverage. These results highlight MAHA's ability to combine embeddings with explicit document structure, enabling effective multimodal retrieval. Our work establishes a scalable and interpretable retrieval framework that advances RAG systems by enabling modality-aware reasoning over unstructured multimodal data.
Related papers
- Scaling Beyond Context: A Survey of Multimodal Retrieval-Augmented Generation for Document Understanding [61.36285696607487]
Document understanding is critical for applications from financial analysis to scientific discovery.<n>Current approaches, whether OCR-based pipelines feeding Large Language Models (LLMs) or native Multimodal LLMs (MLLMs) face key limitations.<n>Retrieval-Augmented Generation (RAG) helps ground models in external data, but documents' multimodal nature, combining text, tables, charts, and layout, demands a more advanced paradigm: Multimodal RAG.
arXiv Detail & Related papers (2025-10-17T02:33:16Z) - RAG-Anything: All-in-One RAG Framework [10.858282833070726]
We present RAG-Anything, a unified framework that enables comprehensive knowledge retrieval across all modalities.<n>Our approach reconceptualizes multimodal content as interconnected knowledge entities rather than isolated data types.
arXiv Detail & Related papers (2025-10-14T09:25:35Z) - CMRAG: Co-modality-based visual document retrieval and question answering [21.016544020685668]
Co-Modality-based RAG (RAG) framework can leverage texts and images for more accurate retrieval and generation.<n>Our framework consistently outperforms single-modality-based RAG in multiple visual document question-answering (VDQA) benchmarks.
arXiv Detail & Related papers (2025-09-02T09:17:57Z) - MMGraphRAG: Bridging Vision and Language with Interpretable Multimodal Knowledge Graphs [6.165053219836395]
We propose MMGraphRAG, which refines visual content through scene graphs and constructs a multimodal knowledge graph.<n>It employs spectral clustering to achieve cross-modal entity linking and retrieves context along reasoning paths to guide the generative process.<n> Experimental results show that MMGraphRAG achieves state-of-the-art performance on the DocBench and MMLongBench datasets.
arXiv Detail & Related papers (2025-07-28T13:16:23Z) - A Multi-Granularity Retrieval Framework for Visually-Rich Documents [4.804551482123172]
We propose a unified multi-granularity multimodal retrieval framework tailored for two benchmark tasks: MMDocIR and M2KR.<n>Our approach integrates hierarchical encoding strategies, modality-aware retrieval mechanisms, and vision-language model (VLM)-based candidate filtering.<n>Our framework demonstrates robust performance without the need for task-specific fine-tuning.
arXiv Detail & Related papers (2025-05-01T02:40:30Z) - HM-RAG: Hierarchical Multi-Agent Multimodal Retrieval Augmented Generation [11.53083922927901]
HM-RAG is a novel Hierarchical Multi-agent Multimodal RAG framework.<n>It pioneers collaborative intelligence for dynamic knowledge synthesis across structured, unstructured, and graph-based data.
arXiv Detail & Related papers (2025-04-13T06:55:33Z) - Benchmarking Retrieval-Augmented Generation in Multi-Modal Contexts [56.7225771305861]
This paper introduces Multi-Modal Retrieval-Augmented Generation (M$2$RAG), a benchmark designed to evaluate the effectiveness of Multi-modal Large Language Models.<n>The benchmark comprises four tasks: image captioning, multi-modal question answering, multi-modal fact verification, and image reranking.<n>To enhance the context utilization capabilities of MLLMs, we also introduce Multi-Modal Retrieval-Augmented Instruction Tuning (MM-RAIT)
arXiv Detail & Related papers (2025-02-24T16:25:25Z) - VisDoM: Multi-Document QA with Visually Rich Elements Using Multimodal Retrieval-Augmented Generation [100.06122876025063]
This paper introduces VisDoMBench, the first comprehensive benchmark designed to evaluate QA systems in multi-document settings.<n>We propose VisDoMRAG, a novel multimodal Retrieval Augmented Generation (RAG) approach that simultaneously utilizes visual and textual RAG.
arXiv Detail & Related papers (2024-12-14T06:24:55Z) - Knowledge-Aware Reasoning over Multimodal Semi-structured Tables [85.24395216111462]
This study investigates whether current AI models can perform knowledge-aware reasoning on multimodal structured data.
We introduce MMTabQA, a new dataset designed for this purpose.
Our experiments highlight substantial challenges for current AI models in effectively integrating and interpreting multiple text and image inputs.
arXiv Detail & Related papers (2024-08-25T15:17:43Z) - Tokenization, Fusion, and Augmentation: Towards Fine-grained Multi-modal Entity Representation [51.80447197290866]
Multi-modal knowledge graph completion (MMKGC) aims to discover unobserved knowledge from given knowledge graphs.<n>Existing MMKGC methods usually extract multi-modal features with pre-trained models.<n>We introduce a novel framework MyGO to tokenize, fuse, and augment the fine-grained multi-modal representations of entities.
arXiv Detail & Related papers (2024-04-15T05:40:41Z) - NativE: Multi-modal Knowledge Graph Completion in the Wild [51.80447197290866]
We propose a comprehensive framework NativE to achieve MMKGC in the wild.
NativE proposes a relation-guided dual adaptive fusion module that enables adaptive fusion for any modalities.
We construct a new benchmark called WildKGC with five datasets to evaluate our method.
arXiv Detail & Related papers (2024-03-28T03:04:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.