Vision-Guided Chunking Is All You Need: Enhancing RAG with Multimodal Document Understanding
- URL: http://arxiv.org/abs/2506.16035v2
- Date: Sun, 13 Jul 2025 19:52:49 GMT
- Title: Vision-Guided Chunking Is All You Need: Enhancing RAG with Multimodal Document Understanding
- Authors: Vishesh Tripathi, Tanmay Odapally, Indraneel Das, Uday Allu, Biddwan Ahmed,
- Abstract summary: Retrieval-Augmented Generation (RAG) systems have revolutionized information retrieval and question answering.<n>Traditional text-based chunking methods struggle with complex document structures, multi-page tables, embedded figures, and contextual dependencies across page boundaries.<n>We present a novel multimodal document chunking approach that leverages Large Multimodal Models (LMMs) to process PDF documents in batches.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Retrieval-Augmented Generation (RAG) systems have revolutionized information retrieval and question answering, but traditional text-based chunking methods struggle with complex document structures, multi-page tables, embedded figures, and contextual dependencies across page boundaries. We present a novel multimodal document chunking approach that leverages Large Multimodal Models (LMMs) to process PDF documents in batches while maintaining semantic coherence and structural integrity. Our method processes documents in configurable page batches with cross-batch context preservation, enabling accurate handling of tables spanning multiple pages, embedded visual elements, and procedural content. We evaluate our approach on a curated dataset of PDF documents with manually crafted queries, demonstrating improvements in chunk quality and downstream RAG performance. Our vision-guided approach achieves better accuracy compared to traditional vanilla RAG systems, with qualitative analysis showing superior preservation of document structure and semantic coherence.
Related papers
- MMRAG-DocQA: A Multi-Modal Retrieval-Augmented Generation Method for Document Question-Answering with Hierarchical Index and Multi-Granularity Retrieval [4.400088031376775]
The aim is to locate and integrate multi-modal evidences distributed across multiple pages, for question understanding and answer generation.<n>A novel multi-modal RAG model, named MMRAG-DocQA, was proposed, leveraging both textual and visual information across long-range pages.<n>By means of joint similarity evaluation and large language model (LLM)-based re-ranking, a multi-granularity semantic retrieval method was proposed.
arXiv Detail & Related papers (2025-08-01T12:22:53Z) - Benchmarking Multimodal Understanding and Complex Reasoning for ESG Tasks [56.350173737493215]
Environmental, Social, and Governance (ESG) reports are essential for evaluating sustainability practices, ensuring regulatory compliance, and promoting financial transparency.<n>MMESGBench is a first-of-its-kind benchmark dataset to evaluate multimodal understanding and complex reasoning across structurally diverse and multi-source ESG documents.<n>MMESGBench comprises 933 validated QA pairs derived from 45 ESG documents, spanning across seven distinct document types and three major ESG source categories.
arXiv Detail & Related papers (2025-07-25T03:58:07Z) - Docopilot: Improving Multimodal Models for Document-Level Understanding [87.60020625241178]
We present a high-quality document-level dataset, Doc-750K, designed to support in-depth understanding of multimodal documents.<n>This dataset includes diverse document structures, extensive cross-page dependencies, and real question-answer pairs derived from the original documents.<n>Building on the dataset, we develop a native multimodal model, Docopilot, which can accurately handle document-level dependencies without relying on RAG.
arXiv Detail & Related papers (2025-07-19T16:03:34Z) - QID: Efficient Query-Informed ViTs in Data-Scarce Regimes for OCR-free Visual Document Understanding [53.69841526266547]
Fine-tuning a pre-trained Vision-Language Model with new datasets often falls short in optimizing the vision encoder.<n>We introduce QID, a novel, streamlined, architecture-preserving approach that integrates query embeddings into the vision encoder.
arXiv Detail & Related papers (2025-04-03T18:47:16Z) - MMDocIR: Benchmarking Multi-Modal Retrieval for Long Documents [26.39534684408116]
This work introduces a new benchmark, named MMDocIR, that encompasses two distinct tasks: page-level and layout-level retrieval.<n>The MMDocIR benchmark comprises a rich dataset featuring 1,685 questions annotated by experts and 173,843 questions with bootstrapped labels.
arXiv Detail & Related papers (2025-01-15T14:30:13Z) - Advanced ingestion process powered by LLM parsing for RAG system [0.0]
This paper introduces a novel multi-strategy parsing approach using LLM-powered OCR to extract content from diverse document types.<n>The methodology employs a node-based extraction technique that creates relationships between different information types and generates context-aware metadata.
arXiv Detail & Related papers (2024-12-16T20:33:33Z) - VisDoM: Multi-Document QA with Visually Rich Elements Using Multimodal Retrieval-Augmented Generation [100.06122876025063]
This paper introduces VisDoMBench, the first comprehensive benchmark designed to evaluate QA systems in multi-document settings.<n>We propose VisDoMRAG, a novel multimodal Retrieval Augmented Generation (RAG) approach that simultaneously utilizes visual and textual RAG.
arXiv Detail & Related papers (2024-12-14T06:24:55Z) - PDF-MVQA: A Dataset for Multimodal Information Retrieval in PDF-based Visual Question Answering [13.625303311724757]
Document Question Answering (QA) presents a challenge in understanding visually-rich documents (VRD)
We propose PDF-MVQA, which is tailored for research journal articles, encompassing multiple pages and multimodal information retrieval.
arXiv Detail & Related papers (2024-04-19T09:00:05Z) - PDFTriage: Question Answering over Long, Structured Documents [60.96667912964659]
Representing structured documents as plain text is incongruous with the user's mental model of these documents with rich structure.
We propose PDFTriage that enables models to retrieve the context based on either structure or content.
Our benchmark dataset consists of 900+ human-generated questions over 80 structured documents.
arXiv Detail & Related papers (2023-09-16T04:29:05Z) - mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document
Understanding [55.4806974284156]
Document understanding refers to automatically extract, analyze and comprehend information from digital documents, such as a web page.
Existing Multi-model Large Language Models (MLLMs) have demonstrated promising zero-shot capabilities in shallow OCR-free text recognition.
arXiv Detail & Related papers (2023-07-04T11:28:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.