LongDocURL: a Comprehensive Multimodal Long Document Benchmark Integrating Understanding, Reasoning, and Locating
- URL: http://arxiv.org/abs/2412.18424v2
- Date: Fri, 27 Dec 2024 08:33:31 GMT
- Title: LongDocURL: a Comprehensive Multimodal Long Document Benchmark Integrating Understanding, Reasoning, and Locating
- Authors: Chao Deng, Jiale Yuan, Pi Bu, Peijie Wang, Zhong-Zhi Li, Jian Xu, Xiao-Hui Li, Yuan Gao, Jun Song, Bo Zheng, Cheng-Lin Liu,
- Abstract summary: Large vision language models (LVLMs) have improved the document understanding capabilities remarkably.
Existing document understanding benchmarks have been limited to handling only a small number of pages.
We develop a semi-automated construction pipeline and collect 2,325 high-quality question-answering pairs, covering more than 33,000 pages of documents.
- Score: 40.44974704748952
- License:
- Abstract: Large vision language models (LVLMs) have improved the document understanding capabilities remarkably, enabling the handling of complex document elements, longer contexts, and a wider range of tasks. However, existing document understanding benchmarks have been limited to handling only a small number of pages and fail to provide a comprehensive analysis of layout elements locating. In this paper, we first define three primary task categories: Long Document Understanding, numerical Reasoning, and cross-element Locating, and then propose a comprehensive benchmark, LongDocURL, integrating above three primary tasks and comprising 20 sub-tasks categorized based on different primary tasks and answer evidences. Furthermore, we develop a semi-automated construction pipeline and collect 2,325 high-quality question-answering pairs, covering more than 33,000 pages of documents, significantly outperforming existing benchmarks. Subsequently, we conduct comprehensive evaluation experiments on both open-source and closed-source models across 26 different configurations, revealing critical performance gaps in this field.
Related papers
- MMDocIR: Benchmarking Multi-Modal Retrieval for Long Documents [26.39534684408116]
This work introduces a new benchmark, named as MMDocIR, encompassing two distinct tasks: page-level and layout-level retrieval.
The MMDocIR benchmark comprises a rich dataset featuring expertly annotated labels for 1,685 questions and bootstrapped labels for 173,843 questions.
arXiv Detail & Related papers (2025-01-15T14:30:13Z) - VisDoM: Multi-Document QA with Visually Rich Elements Using Multimodal Retrieval-Augmented Generation [100.06122876025063]
This paper introduces VisDoMBench, the first comprehensive benchmark designed to evaluate QA systems in multi-document settings.
We propose VisDoMRAG, a novel multimodal Retrieval Augmented Generation (RAG) approach that simultaneously utilizes visual and textual RAG.
arXiv Detail & Related papers (2024-12-14T06:24:55Z) - M3DocRAG: Multi-modal Retrieval is What You Need for Multi-page Multi-document Understanding [63.33447665725129]
We introduce M3DocRAG, a novel multi-modal RAG framework that flexibly accommodates various document contexts.
M3DocRAG can efficiently handle single or many documents while preserving visual information.
We also present M3DocVQA, a new benchmark for evaluating open-domain DocVQA over 3,000+ PDF documents with 40,000+ pages.
arXiv Detail & Related papers (2024-11-07T18:29:38Z) - Multi-Page Document Visual Question Answering using Self-Attention Scoring Mechanism [12.289101189321181]
Document Visual Question Answering (Document VQA) has garnered significant interest from both the document understanding and natural language processing communities.
The state-of-the-art single-page Document VQA methods show impressive performance, yet in multi-page scenarios, these methods struggle.
We propose a novel method and efficient training strategy for multi-page Document VQA tasks.
arXiv Detail & Related papers (2024-04-29T18:07:47Z) - PDF-MVQA: A Dataset for Multimodal Information Retrieval in PDF-based Visual Question Answering [13.625303311724757]
Document Question Answering (QA) presents a challenge in understanding visually-rich documents (VRD)
We propose PDF-MVQA, which is tailored for research journal articles, encompassing multiple pages and multimodal information retrieval.
arXiv Detail & Related papers (2024-04-19T09:00:05Z) - On Task-personalized Multimodal Few-shot Learning for Visually-rich
Document Entity Retrieval [59.25292920967197]
Few-shot document entity retrieval (VDER) is an important topic in industrial NLP applications.
FewVEX is a new dataset to boost future research in the field of entity-level few-shot VDER.
We present a task-aware meta-learning based framework, with a central focus on achieving effective task personalization.
arXiv Detail & Related papers (2023-11-01T17:51:43Z) - PDFTriage: Question Answering over Long, Structured Documents [60.96667912964659]
Representing structured documents as plain text is incongruous with the user's mental model of these documents with rich structure.
We propose PDFTriage that enables models to retrieve the context based on either structure or content.
Our benchmark dataset consists of 900+ human-generated questions over 80 structured documents.
arXiv Detail & Related papers (2023-09-16T04:29:05Z) - VRDU: A Benchmark for Visually-rich Document Understanding [22.040372755535767]
We identify the desiderata for a more comprehensive benchmark and propose one we call Visually Rich Document Understanding (VRDU)
VRDU contains two datasets that represent several challenges: rich schema including diverse data types as well as hierarchical entities, complex templates including tables and multi-column layouts, and diversity of different layouts (templates) within a single document type.
We design few-shot and conventional experiment settings along with a carefully designed matching algorithm to evaluate extraction results.
arXiv Detail & Related papers (2022-11-15T03:17:07Z) - Text Summarization with Latent Queries [60.468323530248945]
We introduce LaQSum, the first unified text summarization system that learns Latent Queries from documents for abstractive summarization with any existing query forms.
Under a deep generative framework, our system jointly optimize a latent query model and a conditional language model, allowing users to plug-and-play queries of any type at test time.
Our system robustly outperforms strong comparison systems across summarization benchmarks with different query types, document settings, and target domains.
arXiv Detail & Related papers (2021-05-31T21:14:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.