WordScape: a Pipeline to extract multilingual, visually rich Documents
with Layout Annotations from Web Crawl Data
- URL: http://arxiv.org/abs/2312.10188v1
- Date: Fri, 15 Dec 2023 20:28:31 GMT
- Title: WordScape: a Pipeline to extract multilingual, visually rich Documents
with Layout Annotations from Web Crawl Data
- Authors: Maurice Weber, Carlo Siebenschuh, Rory Butler, Anton Alexandrov,
Valdemar Thanner, Georgios Tsolakis, Haris Jabbar, Ian Foster, Bo Li, Rick
Stevens, Ce Zhang
- Abstract summary: We introduce WordScape, a novel pipeline for the creation of cross-disciplinary, multilingual corpora.
WordScape parses the Open XML structure of Word documents obtained from the web.
It offers culturally and linguistically diverse document pages with natural semantic structure and high-quality text.
- Score: 13.297444760076406
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: We introduce WordScape, a novel pipeline for the creation of
cross-disciplinary, multilingual corpora comprising millions of pages with
annotations for document layout detection. Relating visual and textual items on
document pages has gained further significance with the advent of multimodal
models. Various approaches proved effective for visual question answering or
layout segmentation. However, the interplay of text, tables, and visuals
remains challenging for a variety of document understanding tasks. In
particular, many models fail to generalize well to diverse domains and new
languages due to insufficient availability of training data. WordScape
addresses these limitations. Our automatic annotation pipeline parses the Open
XML structure of Word documents obtained from the web, jointly providing
layout-annotated document images and their textual representations. In turn,
WordScape offers unique properties as it (1) leverages the ubiquity of the Word
file format on the internet, (2) is readily accessible through the Common Crawl
web corpus, (3) is adaptive to domain-specific documents, and (4) offers
culturally and linguistically diverse document pages with natural semantic
structure and high-quality text. Together with the pipeline, we will
additionally release 9.5M urls to word documents which can be processed using
WordScape to create a dataset of over 40M pages. Finally, we investigate the
quality of text and layout annotations extracted by WordScape, assess the
impact on document understanding benchmarks, and demonstrate that manual
labeling costs can be substantially reduced.
Related papers
- Unified Multi-Modal Interleaved Document Representation for Information Retrieval [57.65409208879344]
We produce more comprehensive and nuanced document representations by holistically embedding documents interleaved with different modalities.
Specifically, we achieve this by leveraging the capability of recent vision-language models that enable the processing and integration of text, images, and tables into a unified format and representation.
arXiv Detail & Related papers (2024-10-03T17:49:09Z) - OmniParser: A Unified Framework for Text Spotting, Key Information Extraction and Table Recognition [79.852642726105]
We propose a unified paradigm for parsing visually-situated text across diverse scenarios.
Specifically, we devise a universal model, called Omni, which can simultaneously handle three typical visually-situated text parsing tasks.
In Omni, all tasks share the unified encoder-decoder architecture, the unified objective point-conditioned text generation, and the unified input representation.
arXiv Detail & Related papers (2024-03-28T03:51:14Z) - Visually Guided Generative Text-Layout Pre-training for Document Intelligence [51.09853181377696]
We propose visually guided generative text-pre-training, named ViTLP.
Given a document image, the model optimize hierarchical language and layout modeling objectives to generate the interleaved text and layout sequence.
ViTLP can function as a native OCR model to localize and recognize texts of document images.
arXiv Detail & Related papers (2024-03-25T08:00:43Z) - DocLLM: A layout-aware generative language model for multimodal document
understanding [12.093889265216205]
We present DocLLM, a lightweight extension to traditional large language models (LLMs) for reasoning over visual documents.
Our model focuses exclusively on bounding box information to incorporate the spatial layout structure.
We demonstrate that our solution outperforms SotA LLMs on 14 out of 16 datasets across all tasks, and generalizes well to 4 out of 5 previously unseen datasets.
arXiv Detail & Related papers (2023-12-31T22:37:52Z) - A Multi-Modal Multilingual Benchmark for Document Image Classification [21.7518357653137]
We introduce two newly curated multilingual datasets WIKI-DOC and MULTIEUR-DOCLEX.
We study popular visually-rich document understanding or Document AI models in previously untested setting in document image classification.
Experimental results show limitations of multilingual Document AI models on cross-lingual transfer across typologically distant languages.
arXiv Detail & Related papers (2023-10-25T04:35:06Z) - PDFTriage: Question Answering over Long, Structured Documents [60.96667912964659]
Representing structured documents as plain text is incongruous with the user's mental model of these documents with rich structure.
We propose PDFTriage that enables models to retrieve the context based on either structure or content.
Our benchmark dataset consists of 900+ human-generated questions over 80 structured documents.
arXiv Detail & Related papers (2023-09-16T04:29:05Z) - PDFVQA: A New Dataset for Real-World VQA on PDF Documents [2.105395241374678]
Document-based Visual Question Answering examines the document understanding of document images in conditions of natural language questions.
Our PDF-VQA dataset extends the current scale of document understanding that limits on the single document page to the new scale that asks questions over the full document of multiple pages.
arXiv Detail & Related papers (2023-04-13T12:28:14Z) - TRIE++: Towards End-to-End Information Extraction from Visually Rich
Documents [51.744527199305445]
This paper proposes a unified end-to-end information extraction framework from visually rich documents.
Text reading and information extraction can reinforce each other via a well-designed multi-modal context block.
The framework can be trained in an end-to-end trainable manner, achieving global optimization.
arXiv Detail & Related papers (2022-07-14T08:52:07Z) - Combining Deep Learning and Reasoning for Address Detection in
Unstructured Text Documents [0.0]
We propose a hybrid approach that combines deep learning with reasoning for finding and extracting addresses from unstructured text documents.
We use a visual deep learning model to detect the boundaries of possible address regions on the scanned document images.
arXiv Detail & Related papers (2022-02-07T12:32:00Z) - Minimally-Supervised Structure-Rich Text Categorization via Learning on
Text-Rich Networks [61.23408995934415]
We propose a novel framework for minimally supervised categorization by learning from the text-rich network.
Specifically, we jointly train two modules with different inductive biases -- a text analysis module for text understanding and a network learning module for class-discriminative, scalable network learning.
Our experiments show that given only three seed documents per category, our framework can achieve an accuracy of about 92%.
arXiv Detail & Related papers (2021-02-23T04:14:34Z) - Kleister: A novel task for Information Extraction involving Long
Documents with Complex Layout [5.8530995077744645]
We introduce a new task (named Kleister) with two new datasets.
An NLP system must find the most important information, about various types of entities, in long formal documents.
We propose Pipeline method as a text-only baseline with different Named Entity Recognition architectures.
arXiv Detail & Related papers (2020-03-04T22:45:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.