Docopilot: Improving Multimodal Models for Document-Level Understanding
- URL: http://arxiv.org/abs/2507.14675v1
- Date: Sat, 19 Jul 2025 16:03:34 GMT
- Title: Docopilot: Improving Multimodal Models for Document-Level Understanding
- Authors: Yuchen Duan, Zhe Chen, Yusong Hu, Weiyun Wang, Shenglong Ye, Botian Shi, Lewei Lu, Qibin Hou, Tong Lu, Hongsheng Li, Jifeng Dai, Wenhai Wang,
- Abstract summary: We present a high-quality document-level dataset, Doc-750K, designed to support in-depth understanding of multimodal documents.<n>This dataset includes diverse document structures, extensive cross-page dependencies, and real question-answer pairs derived from the original documents.<n>Building on the dataset, we develop a native multimodal model, Docopilot, which can accurately handle document-level dependencies without relying on RAG.
- Score: 87.60020625241178
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite significant progress in multimodal large language models (MLLMs), their performance on complex, multi-page document comprehension remains inadequate, largely due to the lack of high-quality, document-level datasets. While current retrieval-augmented generation (RAG) methods offer partial solutions, they suffer from issues, such as fragmented retrieval contexts, multi-stage error accumulation, and extra time costs of retrieval. In this work, we present a high-quality document-level dataset, Doc-750K, designed to support in-depth understanding of multimodal documents. This dataset includes diverse document structures, extensive cross-page dependencies, and real question-answer pairs derived from the original documents. Building on the dataset, we develop a native multimodal model, Docopilot, which can accurately handle document-level dependencies without relying on RAG. Experiments demonstrate that Docopilot achieves superior coherence, accuracy, and efficiency in document understanding tasks and multi-turn interactions, setting a new baseline for document-level multimodal understanding. Data, code, and models are released at https://github.com/OpenGVLab/Docopilot
Related papers
- Are We on the Right Way for Assessing Document Retrieval-Augmented Generation? [16.717935491483146]
Double-Bench is a large-scale, multilingual, and multimodal evaluation system.<n>It produces fine-grained assessment to each component within document RAG systems.<n>It comprises 3,276 documents (72,880 pages) and 5,168 single- and multi-hop queries across 6 languages.
arXiv Detail & Related papers (2025-08-05T16:55:02Z) - Zero-Shot Document Understanding using Pseudo Table of Contents-Guided Retrieval-Augmented Generation [4.875345207589195]
DocsRay is a training-free document understanding system.<n>It integrates pseudo Table of Contents (TOC) generation with hierarchical Retrieval-Augmented Generation (RAG)
arXiv Detail & Related papers (2025-07-31T03:14:45Z) - Vision-Guided Chunking Is All You Need: Enhancing RAG with Multimodal Document Understanding [0.0]
Retrieval-Augmented Generation (RAG) systems have revolutionized information retrieval and question answering.<n>Traditional text-based chunking methods struggle with complex document structures, multi-page tables, embedded figures, and contextual dependencies across page boundaries.<n>We present a novel multimodal document chunking approach that leverages Large Multimodal Models (LMMs) to process PDF documents in batches.
arXiv Detail & Related papers (2025-06-19T05:11:43Z) - M-DocSum: Do LVLMs Genuinely Comprehend Interleaved Image-Text in Document Summarization? [49.53982792497275]
We investigate whether Large Vision-Language Models (LVLMs) genuinely comprehend interleaved image-text in the document.<n>Existing document understanding benchmarks often assess LVLMs using question-answer formats.<n>We introduce a novel and challenging Multimodal Document Summarization Benchmark (M-DocSum-Bench)<n>M-DocSum-Bench comprises 500 high-quality arXiv papers, along with interleaved multimodal summaries aligned with human preferences.
arXiv Detail & Related papers (2025-03-27T07:28:32Z) - BigDocs: An Open Dataset for Training Multimodal Models on Document and Code Tasks [57.589795399265945]
We introduce BigDocs-7.5M, a high-quality, open-access dataset comprising 7.5 million multimodal documents across 30 tasks.<n>We also introduce BigDocs-Bench, a benchmark suite with 10 novel tasks.<n>Our experiments show that training with BigDocs-Bench improves average performance up to 25.8% over closed-source GPT-4o.
arXiv Detail & Related papers (2024-12-05T21:41:20Z) - M-Longdoc: A Benchmark For Multimodal Super-Long Document Understanding And A Retrieval-Aware Tuning Framework [75.95430061891828]
We introduce M-LongDoc, a benchmark of 851 samples, and an automated framework to evaluate the performance of large multimodal models.
We propose a retrieval-aware tuning approach for efficient and effective multimodal document reading.
arXiv Detail & Related papers (2024-11-09T13:30:38Z) - Read and Think: An Efficient Step-wise Multimodal Language Model for Document Understanding and Reasoning [0.0]
Existing document understanding models tend to generate answers with a single word or phrase directly.
We use Multi-modal Large Language Models (MLLMs) to generate step-wise question-and-answer pairs for document images.
We then use the generated high-quality data to train a humanized document understanding and reasoning model, dubbed DocAssistant.
arXiv Detail & Related papers (2024-02-26T01:17:50Z) - PDFTriage: Question Answering over Long, Structured Documents [60.96667912964659]
Representing structured documents as plain text is incongruous with the user's mental model of these documents with rich structure.
We propose PDFTriage that enables models to retrieve the context based on either structure or content.
Our benchmark dataset consists of 900+ human-generated questions over 80 structured documents.
arXiv Detail & Related papers (2023-09-16T04:29:05Z) - DocumentNet: Bridging the Data Gap in Document Pre-Training [78.01647768018485]
We propose a method to collect massive-scale and weakly labeled data from the web to benefit the training of VDER models.
The collected dataset, named DocumentNet, does not depend on specific document types or entity sets.
Experiments on a set of broadly adopted VDER tasks show significant improvements when DocumentNet is incorporated into the pre-training.
arXiv Detail & Related papers (2023-06-15T08:21:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.