Related papers: MMRAG-DocQA: A Multi-Modal Retrieval-Augmented Generation Method for Document Question-Answering with Hierarchical Index and Multi-Granularity Retrieval

MMRAG-DocQA: A Multi-Modal Retrieval-Augmented Generation Method for Document Question-Answering with Hierarchical Index and Multi-Granularity Retrieval

URL: http://arxiv.org/abs/2508.00579v1
Date: Fri, 01 Aug 2025 12:22:53 GMT
Title: MMRAG-DocQA: A Multi-Modal Retrieval-Augmented Generation Method for Document Question-Answering with Hierarchical Index and Multi-Granularity Retrieval
Authors: Ziyu Gong, Yihua Huang, Chengcheng Mai,
Abstract summary: The aim is to locate and integrate multi-modal evidences distributed across multiple pages, for question understanding and answer generation.<n>A novel multi-modal RAG model, named MMRAG-DocQA, was proposed, leveraging both textual and visual information across long-range pages.<n>By means of joint similarity evaluation and large language model (LLM)-based re-ranking, a multi-granularity semantic retrieval method was proposed.
Score: 4.400088031376775
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The multi-modal long-context document question-answering task aims to locate and integrate multi-modal evidences (such as texts, tables, charts, images, and layouts) distributed across multiple pages, for question understanding and answer generation. The existing methods can be categorized into Large Vision-Language Model (LVLM)-based and Retrieval-Augmented Generation (RAG)-based methods. However, the former were susceptible to hallucinations, while the latter struggled for inter-modal disconnection and cross-page fragmentation. To address these challenges, a novel multi-modal RAG model, named MMRAG-DocQA, was proposed, leveraging both textual and visual information across long-range pages to facilitate accurate question answering. A hierarchical indexing method with the integration of flattened in-page chunks and topological cross-page chunks was designed to jointly establish in-page multi-modal associations and long-distance cross-page dependencies. By means of joint similarity evaluation and large language model (LLM)-based re-ranking, a multi-granularity semantic retrieval method, including the page-level parent page retrieval and document-level summary retrieval, was proposed to foster multi-modal evidence connection and long-distance evidence integration and reasoning. Experimental results performed on public datasets, MMLongBench-Doc and LongDocURL, demonstrated the superiority of our MMRAG-DocQA method in understanding and answering modality-rich and multi-page documents.

Related papers

Benchmarking Multimodal Understanding and Complex Reasoning for ESG Tasks [56.350173737493215]
Environmental, Social, and Governance (ESG) reports are essential for evaluating sustainability practices, ensuring regulatory compliance, and promoting financial transparency.<n>MMESGBench is a first-of-its-kind benchmark dataset to evaluate multimodal understanding and complex reasoning across structurally diverse and multi-source ESG documents.<n>MMESGBench comprises 933 validated QA pairs derived from 45 ESG documents, spanning across seven distinct document types and three major ESG source categories.
arXiv Detail & Related papers (2025-07-25T03:58:07Z)
Docopilot: Improving Multimodal Models for Document-Level Understanding [87.60020625241178]
We present a high-quality document-level dataset, Doc-750K, designed to support in-depth understanding of multimodal documents.<n>This dataset includes diverse document structures, extensive cross-page dependencies, and real question-answer pairs derived from the original documents.<n>Building on the dataset, we develop a native multimodal model, Docopilot, which can accurately handle document-level dependencies without relying on RAG.
arXiv Detail & Related papers (2025-07-19T16:03:34Z)
Vision-Guided Chunking Is All You Need: Enhancing RAG with Multimodal Document Understanding [0.0]
Retrieval-Augmented Generation (RAG) systems have revolutionized information retrieval and question answering.<n>Traditional text-based chunking methods struggle with complex document structures, multi-page tables, embedded figures, and contextual dependencies across page boundaries.<n>We present a novel multimodal document chunking approach that leverages Large Multimodal Models (LMMs) to process PDF documents in batches.
arXiv Detail & Related papers (2025-06-19T05:11:43Z)
A Multi-Granularity Retrieval Framework for Visually-Rich Documents [4.804551482123172]
We propose a unified multi-granularity multimodal retrieval framework tailored for two benchmark tasks: MMDocIR and M2KR.<n>Our approach integrates hierarchical encoding strategies, modality-aware retrieval mechanisms, and vision-language model (VLM)-based candidate filtering.<n>Our framework demonstrates robust performance without the need for task-specific fine-tuning.
arXiv Detail & Related papers (2025-05-01T02:40:30Z)
A Unified Retrieval Framework with Document Ranking and EDU Filtering for Multi-document Summarization [18.13855430873805]
Current methods apply truncation after the retrieval process to fit the context length.<n>We propose a novel retrieval-based framework that integrates query selection and document ranking.<n>We evaluate our framework on multiple MDS datasets, demonstrating consistent improvements in ROUGE metrics.
arXiv Detail & Related papers (2025-04-23T13:41:10Z)
M-DocSum: Do LVLMs Genuinely Comprehend Interleaved Image-Text in Document Summarization? [49.53982792497275]
We investigate whether Large Vision-Language Models (LVLMs) genuinely comprehend interleaved image-text in the document.<n>Existing document understanding benchmarks often assess LVLMs using question-answer formats.<n>We introduce a novel and challenging Multimodal Document Summarization Benchmark (M-DocSum-Bench)<n>M-DocSum-Bench comprises 500 high-quality arXiv papers, along with interleaved multimodal summaries aligned with human preferences.
arXiv Detail & Related papers (2025-03-27T07:28:32Z)
Benchmarking Retrieval-Augmented Generation in Multi-Modal Contexts [56.30364248231053]
This paper introduces Multi-Modal Retrieval-Augmented Generation (M2RAG)<n>M2RAG is a benchmark designed to evaluate the effectiveness of Multi-modal Large Language Models (MLLMs)<n>To enhance the context utilization capabilities of MLLMs, we also introduce Multi-Modal Retrieval-Augmented Instruction Tuning (MM-RAIT)
arXiv Detail & Related papers (2025-02-24T16:25:25Z)
MMDocIR: Benchmarking Multi-Modal Retrieval for Long Documents [26.39534684408116]
This work introduces a new benchmark, named MMDocIR, that encompasses two distinct tasks: page-level and layout-level retrieval.<n>The MMDocIR benchmark comprises a rich dataset featuring 1,685 questions annotated by experts and 173,843 questions with bootstrapped labels.
arXiv Detail & Related papers (2025-01-15T14:30:13Z)
VisDoM: Multi-Document QA with Visually Rich Elements Using Multimodal Retrieval-Augmented Generation [100.06122876025063]
This paper introduces VisDoMBench, the first comprehensive benchmark designed to evaluate QA systems in multi-document settings.<n>We propose VisDoMRAG, a novel multimodal Retrieval Augmented Generation (RAG) approach that simultaneously utilizes visual and textual RAG.
arXiv Detail & Related papers (2024-12-14T06:24:55Z)
M3DocRAG: Multi-modal Retrieval is What You Need for Multi-page Multi-document Understanding [63.33447665725129]
We introduce M3DocRAG, a novel multi-modal RAG framework that flexibly accommodates various document contexts. M3DocRAG can efficiently handle single or many documents while preserving visual information. We also present M3DocVQA, a new benchmark for evaluating open-domain DocVQA over 3,000+ PDF documents with 40,000+ pages.
arXiv Detail & Related papers (2024-11-07T18:29:38Z)
PDF-MVQA: A Dataset for Multimodal Information Retrieval in PDF-based Visual Question Answering [13.625303311724757]
Document Question Answering (QA) presents a challenge in understanding visually-rich documents (VRD) We propose PDF-MVQA, which is tailored for research journal articles, encompassing multiple pages and multimodal information retrieval.
arXiv Detail & Related papers (2024-04-19T09:00:05Z)
Enhancing Multi-modal and Multi-hop Question Answering via Structured Knowledge and Unified Retrieval-Generation [33.56304858796142]
Multi-modal multi-hop question answering involves answering a question by reasoning over multiple input sources from different modalities. Existing methods often retrieve evidences separately and then use a language model to generate an answer based on the retrieved evidences. We propose a Structured Knowledge and Unified Retrieval-Generation (RG) approach to address these issues.
arXiv Detail & Related papers (2022-12-16T18:12:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.