VDocRAG: Retrieval-Augmented Generation over Visually-Rich Documents
- URL: http://arxiv.org/abs/2504.09795v1
- Date: Mon, 14 Apr 2025 01:50:33 GMT
- Title: VDocRAG: Retrieval-Augmented Generation over Visually-Rich Documents
- Authors: Ryota Tanaka, Taichi Iki, Taku Hasegawa, Kyosuke Nishida, Kuniko Saito, Jun Suzuki,
- Abstract summary: We introduce a new RAG framework, VDocRAG, which can directly understand varied documents and modalities in a unified image format.<n>We also introduce OpenDocVQA, the first unified collection of open-domain document visual question answering datasets.
- Score: 30.012487475552575
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We aim to develop a retrieval-augmented generation (RAG) framework that answers questions over a corpus of visually-rich documents presented in mixed modalities (e.g., charts, tables) and diverse formats (e.g., PDF, PPTX). In this paper, we introduce a new RAG framework, VDocRAG, which can directly understand varied documents and modalities in a unified image format to prevent missing information that occurs by parsing documents to obtain text. To improve the performance, we propose novel self-supervised pre-training tasks that adapt large vision-language models for retrieval by compressing visual information into dense token representations while aligning them with textual content in documents. Furthermore, we introduce OpenDocVQA, the first unified collection of open-domain document visual question answering datasets, encompassing diverse document types and formats. OpenDocVQA provides a comprehensive resource for training and evaluating retrieval and question answering models on visually-rich documents in an open-domain setting. Experiments show that VDocRAG substantially outperforms conventional text-based RAG and has strong generalization capability, highlighting the potential of an effective RAG paradigm for real-world documents.
Related papers
- Relation-Rich Visual Document Generator for Visual Information Extraction [12.4941229258054]
We propose a Relation-rIch visual Document GEnerator (RIDGE) that addresses these limitations through a two-stage approach.
Our method significantly enhances the performance of document understanding models on various VIE benchmarks.
arXiv Detail & Related papers (2025-04-14T19:19:26Z) - BoundingDocs: a Unified Dataset for Document Question Answering with Spatial Annotations [2.9798896492745537]
We present a unified dataset for document Question-Answering (QA)<n>We reformulate existing Document AI tasks, such as Information Extraction (IE), into a Question-Answering task.<n>On the other hand, we release the OCR of all the documents and include the exact position of the answer to be found in the document image as a bounding box.
arXiv Detail & Related papers (2025-01-06T21:46:22Z) - VISA: Retrieval Augmented Generation with Visual Source Attribution [100.78278689901593]
Existing approaches in RAG primarily link generated content to document-level references.<n>We propose Retrieval-Augmented Generation with Visual Source Attribution (VISA), a novel approach that combines answer generation with visual source attribution.<n>To evaluate its effectiveness, we curated two datasets: Wiki-VISA, based on crawled Wikipedia webpage screenshots, and Paper-VISA, derived from PubLayNet and tailored to the medical domain.
arXiv Detail & Related papers (2024-12-19T02:17:35Z) - VisDoM: Multi-Document QA with Visually Rich Elements Using Multimodal Retrieval-Augmented Generation [100.06122876025063]
This paper introduces VisDoMBench, the first comprehensive benchmark designed to evaluate QA systems in multi-document settings.<n>We propose VisDoMRAG, a novel multimodal Retrieval Augmented Generation (RAG) approach that simultaneously utilizes visual and textual RAG.
arXiv Detail & Related papers (2024-12-14T06:24:55Z) - VisRAG: Vision-based Retrieval-augmented Generation on Multi-modality Documents [66.42579289213941]
Retrieval-augmented generation (RAG) is an effective technique that enables large language models to utilize external knowledge sources for generation.<n>We introduce VisRAG, which tackles this issue by establishing a vision-language model (VLM)-based RAG pipeline.<n>In this pipeline, instead of first parsing the document to obtain text, the document is directly embedded using a VLM as an image and then retrieved to enhance the generation of a VLM.
arXiv Detail & Related papers (2024-10-14T15:04:18Z) - ColPali: Efficient Document Retrieval with Vision Language Models [15.369861972085136]
We introduce the Visual Document Retrieval Benchmark ViDoRe, composed of various page-level retrieval tasks spanning multiple domains, languages, and practical settings.<n>The inherent complexity and performance shortcomings of modern systems motivate a new concept; doing document retrieval by directly embedding the images of the document pages.<n>We release ColPali, a Vision Language Model trained to produce high-quality multi-vector embeddings from images of document pages.
arXiv Detail & Related papers (2024-06-27T15:45:29Z) - A Survey of Generative Information Retrieval [25.1249210843116]
Generative Retrieval (GR) is an emerging paradigm in information retrieval that leverages generative models to map queries to relevant document identifiers (DocIDs) without the need for traditional query processing or document reranking.
This survey provides a comprehensive overview of GR, highlighting key developments, indexing and retrieval strategies, and challenges.
arXiv Detail & Related papers (2024-06-03T10:59:33Z) - Visually Guided Generative Text-Layout Pre-training for Document Intelligence [51.09853181377696]
We propose visually guided generative text-pre-training, named ViTLP.
Given a document image, the model optimize hierarchical language and layout modeling objectives to generate the interleaved text and layout sequence.
ViTLP can function as a native OCR model to localize and recognize texts of document images.
arXiv Detail & Related papers (2024-03-25T08:00:43Z) - Unifying Vision, Text, and Layout for Universal Document Processing [105.36490575974028]
We propose a Document AI model which unifies text, image, and layout modalities together with varied task formats, including document understanding and generation.
Our method sets the state-of-the-art on 9 Document AI tasks, e.g., document understanding and QA, across diverse data domains like finance reports, academic papers, and websites.
arXiv Detail & Related papers (2022-12-05T22:14:49Z) - Generate rather than Retrieve: Large Language Models are Strong Context
Generators [74.87021992611672]
We present a novel perspective for solving knowledge-intensive tasks by replacing document retrievers with large language model generators.
We call our method generate-then-read (GenRead), which first prompts a large language model to generate contextutal documents based on a given question, and then reads the generated documents to produce the final answer.
arXiv Detail & Related papers (2022-09-21T01:30:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.