Related papers: MosaicDoc: A Large-Scale Bilingual Benchmark for Visually Rich Document Understanding

MosaicDoc: A Large-Scale Bilingual Benchmark for Visually Rich Document Understanding

URL: http://arxiv.org/abs/2511.09919v1
Date: Fri, 14 Nov 2025 01:18:47 GMT
Title: MosaicDoc: A Large-Scale Bilingual Benchmark for Visually Rich Document Understanding
Authors: Ketong Chen, Yuhao Chen, Yang Xue,
Abstract summary: MosaicDoc is a large-scale, bilingual (Chinese and English) resource designed to push the boundaries of Visually Rich Document Understanding (VRDU)<n>With 72K images and over 600K QA pairs, MosaicDoc serves as a definitive benchmark for the field.<n>Our evaluation of state-of-the-art models on this benchmark reveals their current limitations in handling real-world document complexity.
Score: 7.650139800950797
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Despite the rapid progress of Vision-Language Models (VLMs), their capabilities are inadequately assessed by existing benchmarks, which are predominantly English-centric, feature simplistic layouts, and support limited tasks. Consequently, they fail to evaluate model performance for Visually Rich Document Understanding (VRDU), a critical challenge involving complex layouts and dense text. To address this, we introduce DocWeaver, a novel multi-agent pipeline that leverages Large Language Models to automatically generate a new benchmark. The result is MosaicDoc, a large-scale, bilingual (Chinese and English) resource designed to push the boundaries of VRDU. Sourced from newspapers and magazines, MosaicDoc features diverse and complex layouts (including multi-column and non-Manhattan), rich stylistic variety from 196 publishers, and comprehensive multi-task annotations (OCR, VQA, reading order, and localization). With 72K images and over 600K QA pairs, MosaicDoc serves as a definitive benchmark for the field. Our extensive evaluation of state-of-the-art models on this benchmark reveals their current limitations in handling real-world document complexity and charts a clear path for future research.

Related papers

SDS KoPub VDR: A Benchmark Dataset for Visual Document Retrieval in Korean Public Documents [10.146296597660598]
Existing benchmarks for visual document retrieval (VDR) largely overlook non-English languages and the structural complexity of official publications.<n>We introduce SDS KoPub VDR, the first large-scale, public benchmark for retrieving and understanding Korean public documents.<n>The benchmark is built upon 361 real-world documents, including 256 files under the KOGL Type 1 license and 105 from official legal portals.
arXiv Detail & Related papers (2025-11-07T01:16:07Z)
UNIDOC-BENCH: A Unified Benchmark for Document-Centric Multimodal RAG [82.84014669683863]
Multimodal retrieval-augmented generation (MM-RAG) is a key approach for applying large language models to real-world knowledge bases.<n>UniDoc-Bench is the first large-scale, realistic benchmark for MM-RAG built from 70k real-world PDF pages.<n>Our experiments show that multimodal text-image fusion RAG systems consistently outperform both unimodal and jointly multimodal embedding-based retrieval.
arXiv Detail & Related papers (2025-10-04T04:30:13Z)
VisR-Bench: An Empirical Study on Visual Retrieval-Augmented Generation for Multilingual Long Document Understanding [49.07705729597171]
VisR-Bench is a benchmark for question-driven multimodal retrieval in long documents.<n>Our benchmark comprises over 35K high-quality QA pairs across 1.2K documents.<n>We evaluate various retrieval models, including text-based methods, multimodal encoders, and MLLMs.
arXiv Detail & Related papers (2025-08-10T21:44:43Z)
DocRefine: An Intelligent Framework for Scientific Document Understanding and Content Optimization based on Multimodal Large Model Agents [25.190790899297788]
DocRefine is an innovative framework designed for intelligent understanding, content refinement, and automated summarization of scientific PDF documents.<n>It orchestrates a sophisticated multi-agent system comprising six specialized and collaborative agents.<n>It consistently outperforms state-of-the-art baselines across various tasks.
arXiv Detail & Related papers (2025-08-09T15:32:52Z)
Are We on the Right Way for Assessing Document Retrieval-Augmented Generation? [16.717935491483146]
Double-Bench is a large-scale, multilingual, and multimodal evaluation system.<n>It produces fine-grained assessment to each component within document RAG systems.<n>It comprises 3,276 documents (72,880 pages) and 5,168 single- and multi-hop queries across 6 languages.
arXiv Detail & Related papers (2025-08-05T16:55:02Z)
HW-MLVQA: Elucidating Multilingual Handwritten Document Understanding with a Comprehensive VQA Benchmark [31.753044906301664]
This article delineates HW-MLVQA, an avant-garde VQA benchmark meticulously crafted to mitigate the dearth of authentic Handwritten document comprehension.<n>It provides a robust benchmark evaluation framework spanning three distinct modalities: text, image, and an integrated image & text modality.
arXiv Detail & Related papers (2025-07-21T14:16:44Z)
Towards Visual Text Grounding of Multimodal Large Language Model [74.22413337117617]
We introduce TRIG, a novel task with a newly designed instruction dataset for benchmarking text-rich image grounding.<n>Specifically, we propose an OCR-LLM-human interaction pipeline to create 800 manually annotated question-answer pairs as a benchmark.<n>A comprehensive evaluation of various MLLMs on our proposed benchmark exposes substantial limitations in their grounding capability on text-rich images.
arXiv Detail & Related papers (2025-04-07T12:01:59Z)
M-DocSum: Do LVLMs Genuinely Comprehend Interleaved Image-Text in Document Summarization? [49.53982792497275]
We investigate whether Large Vision-Language Models (LVLMs) genuinely comprehend interleaved image-text in the document.<n>Existing document understanding benchmarks often assess LVLMs using question-answer formats.<n>We introduce a novel and challenging Multimodal Document Summarization Benchmark (M-DocSum-Bench)<n>M-DocSum-Bench comprises 500 high-quality arXiv papers, along with interleaved multimodal summaries aligned with human preferences.
arXiv Detail & Related papers (2025-03-27T07:28:32Z)
SmolDocling: An ultra-compact vision-language model for end-to-end multi-modal document conversion [9.198920557312865]
We introduce SmolDocling, an ultra-compact vision-language model targeting end-to-end document conversion.<n>Our model comprehensively processes entire pages by generating DocTags, a new universal markup format.<n>SmohDocling exhibits robust performance in correctly reproducing document features such as code listings, tables, equations, charts, lists, and more.
arXiv Detail & Related papers (2025-03-14T16:44:14Z)
PDF-WuKong: A Large Multimodal Model for Efficient Long PDF Reading with End-to-End Sparse Sampling [63.93112754821312]
Multimodal document understanding is a challenging task to process and comprehend large amounts of textual and visual information.<n>Recent advances in Large Language Models (LLMs) have significantly improved the performance of this task.<n>We introduce PDF-WuKong, a multimodal large language model (MLLM) which is designed to enhance multimodal question-answering (QA) for long PDF documents.
arXiv Detail & Related papers (2024-10-08T12:17:42Z)
GlobalDoc: A Cross-Modal Vision-Language Framework for Real-World Document Image Retrieval and Classification [8.880856137902947]
We introduce GlobalDoc, a cross-modal transformer-based architecture pre-trained in a self-supervised manner. GlobalDoc improves the learning of richer semantic concepts by unifying language and visual representations. For proper evaluation, we also propose two novel document-level downstream VDU tasks, Few-Shot Document Image Classification (DIC) and Content-based Document Image Retrieval (DIR)
arXiv Detail & Related papers (2023-09-11T18:35:14Z)
OCRBench: On the Hidden Mystery of OCR in Large Multimodal Models [122.27878464009181]
We conducted a comprehensive evaluation of Large Multimodal Models, such as GPT4V and Gemini, in various text-related visual tasks. OCRBench contains 29 datasets, making it the most comprehensive OCR evaluation benchmark available.
arXiv Detail & Related papers (2023-05-13T11:28:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.