READoc: A Unified Benchmark for Realistic Document Structured Extraction
- URL: http://arxiv.org/abs/2409.05137v2
- Date: Sun, 3 Nov 2024 09:09:21 GMT
- Title: READoc: A Unified Benchmark for Realistic Document Structured Extraction
- Authors: Zichao Li, Aizier Abulaiti, Yaojie Lu, Xuanang Chen, Jia Zheng, Hongyu Lin, Xianpei Han, Le Sun,
- Abstract summary: We introduce a novel benchmark named READoc, which defines DSE as a realistic task.
The READoc dataset is derived from 2,233 diverse and real-world documents from arXiv and GitHub.
In addition, we develop a unified evaluation of state-of-the-art DSE approaches.
- Score: 44.44722729958791
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Document Structured Extraction (DSE) aims to extract structured content from raw documents. Despite the emergence of numerous DSE systems, their unified evaluation remains inadequate, significantly hindering the field's advancement. This problem is largely attributed to existing benchmark paradigms, which exhibit fragmented and localized characteristics. To address these limitations and offer a thorough evaluation of DSE systems, we introduce a novel benchmark named READoc, which defines DSE as a realistic task of converting unstructured PDFs into semantically rich Markdown. The READoc dataset is derived from 2,233 diverse and real-world documents from arXiv and GitHub. In addition, we develop a DSE Evaluation S$^3$uite comprising Standardization, Segmentation and Scoring modules, to conduct a unified evaluation of state-of-the-art DSE approaches. By evaluating a range of pipeline tools, expert visual models, and general VLMs, we identify the gap between current work and the unified, realistic DSE objective for the first time. We aspire that READoc will catalyze future research in DSE, fostering more comprehensive and practical solutions.
Related papers
- Are Large Language Models Good Classifiers? A Study on Edit Intent Classification in Scientific Document Revisions [62.12545440385489]
Large language models (LLMs) have brought substantial advancements in text generation, but their potential for enhancing classification tasks remains underexplored.
We propose a framework for thoroughly investigating fine-tuning LLMs for classification, including both generation- and encoding-based approaches.
We instantiate this framework in edit intent classification (EIC), a challenging and underexplored classification task.
arXiv Detail & Related papers (2024-10-02T20:48:28Z) - CoIR: A Comprehensive Benchmark for Code Information Retrieval Models [56.691926887209895]
We present textbfname (textbfInformation textbfRetrieval Benchmark), a robust and comprehensive benchmark specifically designed to assess code retrieval capabilities.
name comprises textbften meticulously curated code datasets, spanning textbfeight distinctive retrieval tasks across textbfseven diverse domains.
We evaluate nine widely used retrieval models using name, uncovering significant difficulties in performing code retrieval tasks even with state-of-the-art systems.
arXiv Detail & Related papers (2024-07-03T07:58:20Z) - UDA: A Benchmark Suite for Retrieval Augmented Generation in Real-world Document Analysis [7.952225508086861]
In academic literature and finance question answering, data are often found in raw text and tables in HTML or PDF formats.
We introduce a benchmark suite, namely Unstructured Document Analysis (UDA), that involves 2,965 real-world documents and 29,590 expert-annotated Q&A pairs.
arXiv Detail & Related papers (2024-06-21T14:29:39Z) - LayeredDoc: Domain Adaptive Document Restoration with a Layer Separation Approach [9.643486775455841]
This paper introduces a text-graphic layer separation approach that enhances domain adaptability in document image restoration systems.
We propose LayeredDoc, which utilizes two layers of information: the first targets coarse-grained graphic components, while the second refines machine-printed textual content.
We evaluate our approach both qualitatively and quantitatively using a new real-world dataset, LayeredDocDB, developed for this study.
arXiv Detail & Related papers (2024-06-12T19:41:01Z) - On Task-personalized Multimodal Few-shot Learning for Visually-rich
Document Entity Retrieval [59.25292920967197]
Few-shot document entity retrieval (VDER) is an important topic in industrial NLP applications.
FewVEX is a new dataset to boost future research in the field of entity-level few-shot VDER.
We present a task-aware meta-learning based framework, with a central focus on achieving effective task personalization.
arXiv Detail & Related papers (2023-11-01T17:51:43Z) - Deconstructing Self-Supervised Monocular Reconstruction: The Design
Decisions that Matter [63.5550818034739]
This paper presents a framework to evaluate state-of-the-art contributions to self-supervised monocular depth estimation.
It includes pretraining, backbone, architectural design choices and loss functions.
We re-implement, validate and re-evaluate 16 state-of-the-art contributions and introduce a new dataset.
arXiv Detail & Related papers (2022-08-02T14:38:53Z) - Few-Shot Document-Level Relation Extraction [0.0]
We present document-level relation extraction benchmark (FSDLRE)
We argue that document-level corpora provide more realism, particularly regarding none-of-the-above (NOTA) distributions.
We adapt the state-of-the-art sentence-level method MNAV to the document-level and develop it further for improved domain adaptation.
arXiv Detail & Related papers (2022-05-04T13:16:19Z) - Document-Level Relation Extraction with Sentences Importance Estimation
and Focusing [52.069206266557266]
Document-level relation extraction (DocRE) aims to determine the relation between two entities from a document of multiple sentences.
We propose a Sentence Estimation and Focusing (SIEF) framework for DocRE, where we design a sentence importance score and a sentence focusing loss.
Experimental results on two domains show that our SIEF not only improves overall performance, but also makes DocRE models more robust.
arXiv Detail & Related papers (2022-04-27T03:20:07Z) - Cross-Domain Document Layout Analysis Using Document Style Guide [15.799572801059716]
Document layout analysis (DLA) aims to decompose document images into high-level semantic areas.
Many researchers devoted this challenge by synthesizing data to build large training sets.
In this paper, we propose an unsupervised cross-domain DLA framework based on document style guidance.
arXiv Detail & Related papers (2022-01-24T00:49:19Z) - Document-Level Definition Detection in Scholarly Documents: Existing
Models, Error Analyses, and Future Directions [40.64025648548128]
We develop a new definition detection system, HEDDEx, that utilizes syntactic features, transformer encoders, and filters, and evaluate it on a standard sentence-level benchmark.
HEDDEx outperforms the leading system on both the sentence-level and the document-level tasks, by 12.7 F1 points and 14.4 F1 points, respectively.
arXiv Detail & Related papers (2020-10-11T01:16:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.