Related papers: MoLoRAG: Bootstrapping Document Understanding via Multi-modal Logic-aware Retrieval

MoLoRAG: Bootstrapping Document Understanding via Multi-modal Logic-aware Retrieval

URL: http://arxiv.org/abs/2509.07666v1
Date: Sat, 06 Sep 2025 00:59:28 GMT
Title: MoLoRAG: Bootstrapping Document Understanding via Multi-modal Logic-aware Retrieval
Authors: Xixi Wu, Yanchao Tan, Nan Hou, Ruiyang Zhang, Hong Cheng,
Abstract summary: MoLoRAG is a logic-aware retrieval framework for multi-modal, multi-page document understanding.<n>It combines semantic and logical relevance to deliver more accurate retrieval.<n>Experiments on four DocQA datasets demonstrate average improvements of 9.68% in accuracy.
Score: 17.50612953979537
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Document Understanding is a foundational AI capability with broad applications, and Document Question Answering (DocQA) is a key evaluation task. Traditional methods convert the document into text for processing by Large Language Models (LLMs), but this process strips away critical multi-modal information like figures. While Large Vision-Language Models (LVLMs) address this limitation, their constrained input size makes multi-page document comprehension infeasible. Retrieval-augmented generation (RAG) methods mitigate this by selecting relevant pages, but they rely solely on semantic relevance, ignoring logical connections between pages and the query, which is essential for reasoning. To this end, we propose MoLoRAG, a logic-aware retrieval framework for multi-modal, multi-page document understanding. By constructing a page graph that captures contextual relationships between pages, a lightweight VLM performs graph traversal to retrieve relevant pages, including those with logical connections often overlooked. This approach combines semantic and logical relevance to deliver more accurate retrieval. After retrieval, the top-$K$ pages are fed into arbitrary LVLMs for question answering. To enhance flexibility, MoLoRAG offers two variants: a training-free solution for easy deployment and a fine-tuned version to improve logical relevance checking. Experiments on four DocQA datasets demonstrate average improvements of 9.68% in accuracy over LVLM direct inference and 7.44% in retrieval precision over baselines. Codes and datasets are released at https://github.com/WxxShirley/MoLoRAG.

Related papers

Resolving Evidence Sparsity: Agentic Context Engineering for Long-Document Understanding [49.26132236798123]
Vision Language Models (VLMs) have gradually become a primary approach in document understanding.<n>We propose SLEUTH, a multi agent framework that orchestrates a retriever and four collaborative agents in a coarse to fine process.<n>The framework identifies key textual and visual clues within the retrieved pages, filters for salient visual evidence such as tables and charts, and analyzes the query to devise a reasoning strategy.
arXiv Detail & Related papers (2025-11-28T03:09:40Z)
When Thoughts Meet Facts: Reusable Reasoning for Long-Context LMs [64.27273946787344]
Recent Long-Context Language Models can process hundreds of thousands of tokens in a single prompt.<n>We recast reasoning as reusable thought caches, derived from prior problem solving traces.<n>We propose an update strategy that iteratively refines templates derived from training data through natural-language feedback.
arXiv Detail & Related papers (2025-10-08T19:52:35Z)
LAD-RAG: Layout-aware Dynamic RAG for Visually-Rich Document Understanding [37.12229829548839]
We propose LAD-RAG, a novel layout-Aware Dynamic RAG framework.<n>LAD-RAG constructs a symbolic document graph that captures layout structure and cross-page dependencies.<n>Experiments on MMLongBench-Doc, LongDocURL, DUDE, and MP-DocVQA demonstrate that LAD-RAG improves retrieval, achieving over 90% perfect recall on average without any top-k tuning.
arXiv Detail & Related papers (2025-10-08T17:02:04Z)
MMRAG-DocQA: A Multi-Modal Retrieval-Augmented Generation Method for Document Question-Answering with Hierarchical Index and Multi-Granularity Retrieval [4.400088031376775]
The aim is to locate and integrate multi-modal evidences distributed across multiple pages, for question understanding and answer generation.<n>A novel multi-modal RAG model, named MMRAG-DocQA, was proposed, leveraging both textual and visual information across long-range pages.<n>By means of joint similarity evaluation and large language model (LLM)-based re-ranking, a multi-granularity semantic retrieval method was proposed.
arXiv Detail & Related papers (2025-08-01T12:22:53Z)
SimpleDoc: Multi-Modal Document Understanding with Dual-Cue Page Retrieval and Iterative Refinement [17.272061289197342]
Document Visual Question Answering (DocVQA) is a practical yet challenging task.<n>Recent methods follow a similar Retrieval Augmented Generation (RAG) pipeline.<n>We introduce SimpleDoc, a lightweight yet powerful retrieval - augmented framework for DocVQA.
arXiv Detail & Related papers (2025-06-16T22:15:58Z)
Doc-CoB: Enhancing Multi-Modal Document Understanding with Visual Chain-of-Boxes Reasoning [12.17399365931]
Existing one-pass MLLMs process entire document images without considering query relevance.<n>Inspired by the human coarse-to-fine reading pattern, we introduce Doc-CoB, a simple-yet-effective mechanism that integrates human-style visual reasoning into MLLM.<n>Our method allows the model to autonomously select the set of regions most relevant to the query, and then focus attention on them for further understanding.
arXiv Detail & Related papers (2025-05-24T08:53:05Z)
M-DocSum: Do LVLMs Genuinely Comprehend Interleaved Image-Text in Document Summarization? [49.53982792497275]
We investigate whether Large Vision-Language Models (LVLMs) genuinely comprehend interleaved image-text in the document.<n>Existing document understanding benchmarks often assess LVLMs using question-answer formats.<n>We introduce a novel and challenging Multimodal Document Summarization Benchmark (M-DocSum-Bench)<n>M-DocSum-Bench comprises 500 high-quality arXiv papers, along with interleaved multimodal summaries aligned with human preferences.
arXiv Detail & Related papers (2025-03-27T07:28:32Z)
Learning More Effective Representations for Dense Retrieval through Deliberate Thinking Before Search [65.53881294642451]
Deliberate Thinking based Dense Retriever (DEBATER)<n>DEBATER enhances recent dense retrievers by enabling them to learn more effective document representations through a step-by-step thinking process.<n> Experimental results show that DEBATER significantly outperforms existing methods across several retrieval benchmarks.
arXiv Detail & Related papers (2025-02-18T15:56:34Z)
Towards Text-Image Interleaved Retrieval [49.96332254241075]
We introduce the text-image interleaved retrieval (TIIR) task, where the query and document are interleaved text-image sequences.<n>We construct a TIIR benchmark based on naturally interleaved wikiHow tutorials, where a specific pipeline is designed to generate interleaved queries.<n>We propose a novel Matryoshka Multimodal Embedder (MME), which compresses the number of visual tokens at different granularity.
arXiv Detail & Related papers (2025-02-18T12:00:47Z)
M3DocRAG: Multi-modal Retrieval is What You Need for Multi-page Multi-document Understanding [63.33447665725129]
We introduce M3DocRAG, a novel multi-modal RAG framework that flexibly accommodates various document contexts. M3DocRAG can efficiently handle single or many documents while preserving visual information. We also present M3DocVQA, a new benchmark for evaluating open-domain DocVQA over 3,000+ PDF documents with 40,000+ pages.
arXiv Detail & Related papers (2024-11-07T18:29:38Z)
Read and Think: An Efficient Step-wise Multimodal Language Model for Document Understanding and Reasoning [0.0]
Existing document understanding models tend to generate answers with a single word or phrase directly. We use Multi-modal Large Language Models (MLLMs) to generate step-wise question-and-answer pairs for document images. We then use the generated high-quality data to train a humanized document understanding and reasoning model, dubbed DocAssistant.
arXiv Detail & Related papers (2024-02-26T01:17:50Z)
Query2doc: Query Expansion with Large Language Models [69.9707552694766]
The proposed method first generates pseudo- documents by few-shot prompting large language models (LLMs) query2doc boosts the performance of BM25 by 3% to 15% on ad-hoc IR datasets. Our method also benefits state-of-the-art dense retrievers in terms of both in-domain and out-of-domain results.
arXiv Detail & Related papers (2023-03-14T07:27:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.