Document Layout Annotation: Database and Benchmark in the Domain of
Public Affairs
- URL: http://arxiv.org/abs/2306.10046v2
- Date: Tue, 8 Aug 2023 09:46:21 GMT
- Title: Document Layout Annotation: Database and Benchmark in the Domain of
Public Affairs
- Authors: Alejandro Pe\~na, Aythami Morales, Julian Fierrez, Javier
Ortega-Garcia, Marcos Grande, I\~nigo Puente, Jorge Cordova, Gonzalo Cordova
- Abstract summary: We propose a procedure to semi-automatically annotate digital documents with different layout labels.
We collect a novel database for DLA in the public affairs domain using a set of 24 data sources from the Spanish Administration.
The results of our experiments validate the proposed text labeling procedure with accuracy up to 99%.
- Score: 62.38140271294419
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Every day, thousands of digital documents are generated with useful
information for companies, public organizations, and citizens. Given the
impossibility of processing them manually, the automatic processing of these
documents is becoming increasingly necessary in certain sectors. However, this
task remains challenging, since in most cases a text-only based parsing is not
enough to fully understand the information presented through different
components of varying significance. In this regard, Document Layout Analysis
(DLA) has been an interesting research field for many years, which aims to
detect and classify the basic components of a document. In this work, we used a
procedure to semi-automatically annotate digital documents with different
layout labels, including 4 basic layout blocks and 4 text categories. We apply
this procedure to collect a novel database for DLA in the public affairs
domain, using a set of 24 data sources from the Spanish Administration. The
database comprises 37.9K documents with more than 441K document pages, and more
than 8M labels associated to 8 layout block units. The results of our
experiments validate the proposed text labeling procedure with accuracy up to
99%.
Related papers
- Unified Multi-Modal Interleaved Document Representation for Information Retrieval [57.65409208879344]
We produce more comprehensive and nuanced document representations by holistically embedding documents interleaved with different modalities.
Specifically, we achieve this by leveraging the capability of recent vision-language models that enable the processing and integration of text, images, and tables into a unified format and representation.
arXiv Detail & Related papers (2024-10-03T17:49:09Z) - DocGenome: An Open Large-scale Scientific Document Benchmark for Training and Testing Multi-modal Large Language Models [63.466265039007816]
We present DocGenome, a structured document benchmark constructed by annotating 500K scientific documents from 153 disciplines in the arXiv open-access community.
We conduct extensive experiments to demonstrate the advantages of DocGenome and objectively evaluate the performance of large models on our benchmark.
arXiv Detail & Related papers (2024-06-17T15:13:52Z) - BuDDIE: A Business Document Dataset for Multi-task Information Extraction [18.440587946049845]
BuDDIE is the first multi-task dataset of 1,665 real-world business documents.
Our dataset consists of publicly available business entity documents from US state government websites.
arXiv Detail & Related papers (2024-04-05T10:26:42Z) - FATURA: A Multi-Layout Invoice Image Dataset for Document Analysis and
Understanding [8.855033708082832]
We introduce FATURA, a pivotal resource for researchers in the field of document analysis and understanding.
FATURA is a highly diverse dataset featuring multi- annotated invoice document images.
We provide comprehensive benchmarks for various document analysis and understanding tasks and conduct experiments under diverse training and evaluation scenarios.
arXiv Detail & Related papers (2023-11-20T15:51:14Z) - PDFTriage: Question Answering over Long, Structured Documents [60.96667912964659]
Representing structured documents as plain text is incongruous with the user's mental model of these documents with rich structure.
We propose PDFTriage that enables models to retrieve the context based on either structure or content.
Our benchmark dataset consists of 900+ human-generated questions over 80 structured documents.
arXiv Detail & Related papers (2023-09-16T04:29:05Z) - DocumentNet: Bridging the Data Gap in Document Pre-Training [78.01647768018485]
We propose a method to collect massive-scale and weakly labeled data from the web to benefit the training of VDER models.
The collected dataset, named DocumentNet, does not depend on specific document types or entity sets.
Experiments on a set of broadly adopted VDER tasks show significant improvements when DocumentNet is incorporated into the pre-training.
arXiv Detail & Related papers (2023-06-15T08:21:15Z) - BaDLAD: A Large Multi-Domain Bengali Document Layout Analysis Dataset [1.2015699532079325]
This dataset contains 33,695 human annotated document samples from six domains.
We demonstrate the efficacy of our dataset in training deep learning based Bengali document models.
arXiv Detail & Related papers (2023-03-09T15:15:55Z) - DocILE Benchmark for Document Information Localization and Extraction [7.944448547470927]
This paper introduces the DocILE benchmark with the largest dataset of business documents for the tasks of Key Information Localization and Extraction and Line Item Recognition.
It contains 6.7k annotated business documents, 100k synthetically generated documents, and nearly1M unlabeled documents for unsupervised pre-training.
arXiv Detail & Related papers (2023-02-11T11:32:10Z) - Combining Deep Learning and Reasoning for Address Detection in
Unstructured Text Documents [0.0]
We propose a hybrid approach that combines deep learning with reasoning for finding and extracting addresses from unstructured text documents.
We use a visual deep learning model to detect the boundaries of possible address regions on the scanned document images.
arXiv Detail & Related papers (2022-02-07T12:32:00Z) - MIDV-2020: A Comprehensive Benchmark Dataset for Identity Document
Analysis [48.35030471041193]
MIDV-2020 consists of 1000 video clips, 2000 scanned images, and 1000 photos of 1000 unique mock identity documents.
With 72409 annotated images in total, to the date of publication the proposed dataset is the largest publicly available identity documents dataset.
arXiv Detail & Related papers (2021-07-01T12:14:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.