MosaicDoc: A Large-Scale Bilingual Benchmark for Visually Rich Document Understanding
- URL: http://arxiv.org/abs/2511.09919v1
- Date: Fri, 14 Nov 2025 01:18:47 GMT
- Title: MosaicDoc: A Large-Scale Bilingual Benchmark for Visually Rich Document Understanding
- Authors: Ketong Chen, Yuhao Chen, Yang Xue,
- Abstract summary: MosaicDoc is a large-scale, bilingual (Chinese and English) resource designed to push the boundaries of Visually Rich Document Understanding (VRDU)<n>With 72K images and over 600K QA pairs, MosaicDoc serves as a definitive benchmark for the field.<n>Our evaluation of state-of-the-art models on this benchmark reveals their current limitations in handling real-world document complexity.
- Score: 7.650139800950797
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite the rapid progress of Vision-Language Models (VLMs), their capabilities are inadequately assessed by existing benchmarks, which are predominantly English-centric, feature simplistic layouts, and support limited tasks. Consequently, they fail to evaluate model performance for Visually Rich Document Understanding (VRDU), a critical challenge involving complex layouts and dense text. To address this, we introduce DocWeaver, a novel multi-agent pipeline that leverages Large Language Models to automatically generate a new benchmark. The result is MosaicDoc, a large-scale, bilingual (Chinese and English) resource designed to push the boundaries of VRDU. Sourced from newspapers and magazines, MosaicDoc features diverse and complex layouts (including multi-column and non-Manhattan), rich stylistic variety from 196 publishers, and comprehensive multi-task annotations (OCR, VQA, reading order, and localization). With 72K images and over 600K QA pairs, MosaicDoc serves as a definitive benchmark for the field. Our extensive evaluation of state-of-the-art models on this benchmark reveals their current limitations in handling real-world document complexity and charts a clear path for future research.
Related papers
- SDS KoPub VDR: A Benchmark Dataset for Visual Document Retrieval in Korean Public Documents [10.146296597660598]
Existing benchmarks for visual document retrieval (VDR) largely overlook non-English languages and the structural complexity of official publications.<n>We introduce SDS KoPub VDR, the first large-scale, public benchmark for retrieving and understanding Korean public documents.<n>The benchmark is built upon 361 real-world documents, including 256 files under the KOGL Type 1 license and 105 from official legal portals.
arXiv Detail & Related papers (2025-11-07T01:16:07Z) - UNIDOC-BENCH: A Unified Benchmark for Document-Centric Multimodal RAG [82.84014669683863]
Multimodal retrieval-augmented generation (MM-RAG) is a key approach for applying large language models to real-world knowledge bases.<n>UniDoc-Bench is the first large-scale, realistic benchmark for MM-RAG built from 70k real-world PDF pages.<n>Our experiments show that multimodal text-image fusion RAG systems consistently outperform both unimodal and jointly multimodal embedding-based retrieval.
arXiv Detail & Related papers (2025-10-04T04:30:13Z) - VisR-Bench: An Empirical Study on Visual Retrieval-Augmented Generation for Multilingual Long Document Understanding [49.07705729597171]
VisR-Bench is a benchmark for question-driven multimodal retrieval in long documents.<n>Our benchmark comprises over 35K high-quality QA pairs across 1.2K documents.<n>We evaluate various retrieval models, including text-based methods, multimodal encoders, and MLLMs.
arXiv Detail & Related papers (2025-08-10T21:44:43Z) - DocRefine: An Intelligent Framework for Scientific Document Understanding and Content Optimization based on Multimodal Large Model Agents [25.190790899297788]
DocRefine is an innovative framework designed for intelligent understanding, content refinement, and automated summarization of scientific PDF documents.<n>It orchestrates a sophisticated multi-agent system comprising six specialized and collaborative agents.<n>It consistently outperforms state-of-the-art baselines across various tasks.
arXiv Detail & Related papers (2025-08-09T15:32:52Z) - Are We on the Right Way for Assessing Document Retrieval-Augmented Generation? [16.717935491483146]
Double-Bench is a large-scale, multilingual, and multimodal evaluation system.<n>It produces fine-grained assessment to each component within document RAG systems.<n>It comprises 3,276 documents (72,880 pages) and 5,168 single- and multi-hop queries across 6 languages.
arXiv Detail & Related papers (2025-08-05T16:55:02Z) - HW-MLVQA: Elucidating Multilingual Handwritten Document Understanding with a Comprehensive VQA Benchmark [31.753044906301664]
This article delineates HW-MLVQA, an avant-garde VQA benchmark meticulously crafted to mitigate the dearth of authentic Handwritten document comprehension.<n>It provides a robust benchmark evaluation framework spanning three distinct modalities: text, image, and an integrated image & text modality.
arXiv Detail & Related papers (2025-07-21T14:16:44Z) - Towards Visual Text Grounding of Multimodal Large Language Model [74.22413337117617]
We introduce TRIG, a novel task with a newly designed instruction dataset for benchmarking text-rich image grounding.<n>Specifically, we propose an OCR-LLM-human interaction pipeline to create 800 manually annotated question-answer pairs as a benchmark.<n>A comprehensive evaluation of various MLLMs on our proposed benchmark exposes substantial limitations in their grounding capability on text-rich images.
arXiv Detail & Related papers (2025-04-07T12:01:59Z) - M-DocSum: Do LVLMs Genuinely Comprehend Interleaved Image-Text in Document Summarization? [49.53982792497275]
We investigate whether Large Vision-Language Models (LVLMs) genuinely comprehend interleaved image-text in the document.<n>Existing document understanding benchmarks often assess LVLMs using question-answer formats.<n>We introduce a novel and challenging Multimodal Document Summarization Benchmark (M-DocSum-Bench)<n>M-DocSum-Bench comprises 500 high-quality arXiv papers, along with interleaved multimodal summaries aligned with human preferences.
arXiv Detail & Related papers (2025-03-27T07:28:32Z) - SmolDocling: An ultra-compact vision-language model for end-to-end multi-modal document conversion [9.198920557312865]
We introduce SmolDocling, an ultra-compact vision-language model targeting end-to-end document conversion.<n>Our model comprehensively processes entire pages by generating DocTags, a new universal markup format.<n>SmohDocling exhibits robust performance in correctly reproducing document features such as code listings, tables, equations, charts, lists, and more.
arXiv Detail & Related papers (2025-03-14T16:44:14Z) - PDF-WuKong: A Large Multimodal Model for Efficient Long PDF Reading with End-to-End Sparse Sampling [63.93112754821312]
Multimodal document understanding is a challenging task to process and comprehend large amounts of textual and visual information.<n>Recent advances in Large Language Models (LLMs) have significantly improved the performance of this task.<n>We introduce PDF-WuKong, a multimodal large language model (MLLM) which is designed to enhance multimodal question-answering (QA) for long PDF documents.
arXiv Detail & Related papers (2024-10-08T12:17:42Z) - GlobalDoc: A Cross-Modal Vision-Language Framework for Real-World Document Image Retrieval and Classification [8.880856137902947]
We introduce GlobalDoc, a cross-modal transformer-based architecture pre-trained in a self-supervised manner.
GlobalDoc improves the learning of richer semantic concepts by unifying language and visual representations.
For proper evaluation, we also propose two novel document-level downstream VDU tasks, Few-Shot Document Image Classification (DIC) and Content-based Document Image Retrieval (DIR)
arXiv Detail & Related papers (2023-09-11T18:35:14Z) - OCRBench: On the Hidden Mystery of OCR in Large Multimodal Models [122.27878464009181]
We conducted a comprehensive evaluation of Large Multimodal Models, such as GPT4V and Gemini, in various text-related visual tasks.
OCRBench contains 29 datasets, making it the most comprehensive OCR evaluation benchmark available.
arXiv Detail & Related papers (2023-05-13T11:28:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.