Related papers: SimpleDoc: Multi-Modal Document Understanding with Dual-Cue Page Retrieval and Iterative Refinement

SimpleDoc: Multi-Modal Document Understanding with Dual-Cue Page Retrieval and Iterative Refinement

URL: http://arxiv.org/abs/2506.14035v1
Date: Mon, 16 Jun 2025 22:15:58 GMT
Title: SimpleDoc: Multi-Modal Document Understanding with Dual-Cue Page Retrieval and Iterative Refinement
Authors: Chelsi Jain, Yiran Wu, Yifan Zeng, Jiale Liu, S hengyu Dai, Zhenwen Shao, Qingyun Wu, Huazheng Wang,
Abstract summary: Document Visual Question Answering (DocVQA) is a practical yet challenging task.<n>Recent methods follow a similar Retrieval Augmented Generation (RAG) pipeline.<n>We introduce SimpleDoc, a lightweight yet powerful retrieval - augmented framework for DocVQA.
Score: 17.272061289197342
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Document Visual Question Answering (DocVQA) is a practical yet challenging task, which is to ask questions based on documents while referring to multiple pages and different modalities of information, e.g, images and tables. To handle multi-modality, recent methods follow a similar Retrieval Augmented Generation (RAG) pipeline, but utilize Visual Language Models (VLMs) based embedding model to embed and retrieve relevant pages as images, and generate answers with VLMs that can accept an image as input. In this paper, we introduce SimpleDoc, a lightweight yet powerful retrieval - augmented framework for DocVQA. It boosts evidence page gathering by first retrieving candidates through embedding similarity and then filtering and re-ranking these candidates based on page summaries. A single VLM-based reasoner agent repeatedly invokes this dual-cue retriever, iteratively pulling fresh pages into a working memory until the question is confidently answered. SimpleDoc outperforms previous baselines by 3.2% on average on 4 DocVQA datasets with much fewer pages retrieved. Our code is available at https://github.com/ag2ai/SimpleDoc.

Related papers

M3DocRAG: Multi-modal Retrieval is What You Need for Multi-page Multi-document Understanding [63.33447665725129]
We introduce M3DocRAG, a novel multi-modal RAG framework that flexibly accommodates various document contexts. M3DocRAG can efficiently handle single or many documents while preserving visual information. We also present M3DocVQA, a new benchmark for evaluating open-domain DocVQA over 3,000+ PDF documents with 40,000+ pages.
arXiv Detail & Related papers (2024-11-07T18:29:38Z)
mPLUG-DocOwl2: High-resolution Compressing for OCR-free Multi-page Document Understanding [103.05835688963947]
We propose a High-resolution DocCompressor module to compress each high-resolution document image into 324 tokens. DocOwl2 sets a new state-of-the-art across multi-page document understanding benchmarks and reduces first token latency by more than 50%. Compared to single-image MLLMs trained on similar data, our DocOwl2 achieves comparable single-page understanding performance with less than 20% of the visual tokens.
arXiv Detail & Related papers (2024-09-05T11:09:00Z)
Multi-Page Document Visual Question Answering using Self-Attention Scoring Mechanism [12.289101189321181]
Document Visual Question Answering (Document VQA) has garnered significant interest from both the document understanding and natural language processing communities. The state-of-the-art single-page Document VQA methods show impressive performance, yet in multi-page scenarios, these methods struggle. We propose a novel method and efficient training strategy for multi-page Document VQA tasks.
arXiv Detail & Related papers (2024-04-29T18:07:47Z)
Large Language Models are Built-in Autoregressive Search Engines [19.928494069013485]
Large language models (LLMs) can follow human instructions to directly generate URLs for document retrieval. LLMs can generate Web URLs where nearly 90% of the corresponding documents contain correct answers to open-domain questions.
arXiv Detail & Related papers (2023-05-16T17:04:48Z)
Hierarchical multimodal transformers for Multi-Page DocVQA [9.115927248875566]
Existing work on DocVQA only considers single-page documents. In this work we extend DocVQA to the multi-page scenario. We propose a new hierarchical method, Hi-VT5, that overcomes the limitations of current methods to process long multi-page documents.
arXiv Detail & Related papers (2022-12-07T10:09:49Z)
Generate rather than Retrieve: Large Language Models are Strong Context Generators [74.87021992611672]
We present a novel perspective for solving knowledge-intensive tasks by replacing document retrievers with large language model generators. We call our method generate-then-read (GenRead), which first prompts a large language model to generate contextutal documents based on a given question, and then reads the generated documents to produce the final answer.
arXiv Detail & Related papers (2022-09-21T01:30:59Z)
Learning Diverse Document Representations with Deep Query Interactions for Dense Retrieval [79.37614949970013]
We propose a new dense retrieval model which learns diverse document representations with deep query interactions. Our model encodes each document with a set of generated pseudo-queries to get query-informed, multi-view document representations.
arXiv Detail & Related papers (2022-08-08T16:00:55Z)
Multi-View Document Representation Learning for Open-Domain Dense Retrieval [87.11836738011007]
This paper proposes a multi-view document representation learning framework. It aims to produce multi-view embeddings to represent documents and enforce them to align with different queries. Experiments show our method outperforms recent works and achieves state-of-the-art results.
arXiv Detail & Related papers (2022-03-16T03:36:38Z)
End-to-End Multihop Retrieval for Compositional Question Answering over Long Documents [93.55268936974971]
We propose a multi-hop retrieval method, DocHopper, to answer compositional questions over long documents. At each step, DocHopper retrieves a paragraph or sentence embedding from the document, mixes the retrieved result with the query, and updates the query for the next step. We demonstrate that utilizing document structure in this was can largely improve question-answering and retrieval performance on long documents.
arXiv Detail & Related papers (2021-06-01T03:13:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.