Kleister: A novel task for Information Extraction involving Long
Documents with Complex Layout
- URL: http://arxiv.org/abs/2003.02356v2
- Date: Fri, 6 Mar 2020 18:51:54 GMT
- Title: Kleister: A novel task for Information Extraction involving Long
Documents with Complex Layout
- Authors: Filip Grali\'nski, Tomasz Stanis{\l}awek, Anna Wr\'oblewska, Dawid
Lipi\'nski, Agnieszka Kaliska, Paulina Rosalska, Bartosz Topolski,
Przemys{\l}aw Biecek
- Abstract summary: We introduce a new task (named Kleister) with two new datasets.
An NLP system must find the most important information, about various types of entities, in long formal documents.
We propose Pipeline method as a text-only baseline with different Named Entity Recognition architectures.
- Score: 5.8530995077744645
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: State-of-the-art solutions for Natural Language Processing (NLP) are able to
capture a broad range of contexts, like the sentence-level context or
document-level context for short documents. But these solutions are still
struggling when it comes to longer, real-world documents with the information
encoded in the spatial structure of the document, such as page elements like
tables, forms, headers, openings or footers; complex page layout or presence of
multiple pages.
To encourage progress on deeper and more complex Information Extraction (IE)
we introduce a new task (named Kleister) with two new datasets. Utilizing both
textual and structural layout features, an NLP system must find the most
important information, about various types of entities, in long formal
documents. We propose Pipeline method as a text-only baseline with different
Named Entity Recognition architectures (Flair, BERT, RoBERTa). Moreover, we
checked the most popular PDF processing tools for text extraction (pdf2djvu,
Tesseract and Textract) in order to analyze behavior of IE system in presence
of errors introduced by these tools.
Related papers
- Unified Multi-Modal Interleaved Document Representation for Information Retrieval [57.65409208879344]
We produce more comprehensive and nuanced document representations by holistically embedding documents interleaved with different modalities.
Specifically, we achieve this by leveraging the capability of recent vision-language models that enable the processing and integration of text, images, and tables into a unified format and representation.
arXiv Detail & Related papers (2024-10-03T17:49:09Z) - PDF-MVQA: A Dataset for Multimodal Information Retrieval in PDF-based Visual Question Answering [13.625303311724757]
Document Question Answering (QA) presents a challenge in understanding visually-rich documents (VRD)
We propose PDF-MVQA, which is tailored for research journal articles, encompassing multiple pages and multimodal information retrieval.
arXiv Detail & Related papers (2024-04-19T09:00:05Z) - WordScape: a Pipeline to extract multilingual, visually rich Documents
with Layout Annotations from Web Crawl Data [13.297444760076406]
We introduce WordScape, a novel pipeline for the creation of cross-disciplinary, multilingual corpora.
WordScape parses the Open XML structure of Word documents obtained from the web.
It offers culturally and linguistically diverse document pages with natural semantic structure and high-quality text.
arXiv Detail & Related papers (2023-12-15T20:28:31Z) - PDFTriage: Question Answering over Long, Structured Documents [60.96667912964659]
Representing structured documents as plain text is incongruous with the user's mental model of these documents with rich structure.
We propose PDFTriage that enables models to retrieve the context based on either structure or content.
Our benchmark dataset consists of 900+ human-generated questions over 80 structured documents.
arXiv Detail & Related papers (2023-09-16T04:29:05Z) - DAPR: A Benchmark on Document-Aware Passage Retrieval [57.45793782107218]
We propose and name this task emphDocument-Aware Passage Retrieval (DAPR)
While analyzing the errors of the State-of-The-Art (SoTA) passage retrievers, we find the major errors (53.5%) are due to missing document context.
Our created benchmark enables future research on developing and comparing retrieval systems for the new task.
arXiv Detail & Related papers (2023-05-23T10:39:57Z) - PDFVQA: A New Dataset for Real-World VQA on PDF Documents [2.105395241374678]
Document-based Visual Question Answering examines the document understanding of document images in conditions of natural language questions.
Our PDF-VQA dataset extends the current scale of document understanding that limits on the single document page to the new scale that asks questions over the full document of multiple pages.
arXiv Detail & Related papers (2023-04-13T12:28:14Z) - TRIE++: Towards End-to-End Information Extraction from Visually Rich
Documents [51.744527199305445]
This paper proposes a unified end-to-end information extraction framework from visually rich documents.
Text reading and information extraction can reinforce each other via a well-designed multi-modal context block.
The framework can be trained in an end-to-end trainable manner, achieving global optimization.
arXiv Detail & Related papers (2022-07-14T08:52:07Z) - Layout-Aware Information Extraction for Document-Grounded Dialogue:
Dataset, Method and Demonstration [75.47708732473586]
We propose a layout-aware document-level Information Extraction dataset, LIE, to facilitate the study of extracting both structural and semantic knowledge from visually rich documents.
LIE contains 62k annotations of three extraction tasks from 4,061 pages in product and official documents.
Empirical results show that layout is critical for VRD-based extraction, and system demonstration also verifies that the extracted knowledge can help locate the answers that users care about.
arXiv Detail & Related papers (2022-07-14T07:59:45Z) - Combining Deep Learning and Reasoning for Address Detection in
Unstructured Text Documents [0.0]
We propose a hybrid approach that combines deep learning with reasoning for finding and extracting addresses from unstructured text documents.
We use a visual deep learning model to detect the boundaries of possible address regions on the scanned document images.
arXiv Detail & Related papers (2022-02-07T12:32:00Z) - StrucTexT: Structured Text Understanding with Multi-Modal Transformers [29.540122964399046]
Structured text understanding on Visually Rich Documents (VRDs) is a crucial part of Document Intelligence.
This paper proposes a unified framework named StrucTexT, which is flexible and effective for handling both sub-tasks.
We evaluate our method for structured text understanding at segment-level and token-level and show it outperforms the state-of-the-art counterparts.
arXiv Detail & Related papers (2021-08-06T02:57:07Z) - SciREX: A Challenge Dataset for Document-Level Information Extraction [56.83748634747753]
It is challenging to create a large-scale information extraction dataset at the document level.
We introduce SciREX, a document level IE dataset that encompasses multiple IE tasks.
We develop a neural model as a strong baseline that extends previous state-of-the-art IE models to document-level IE.
arXiv Detail & Related papers (2020-05-01T17:30:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.