DocLayNet: A Large Human-Annotated Dataset for Document-Layout Analysis
- URL: http://arxiv.org/abs/2206.01062v1
- Date: Thu, 2 Jun 2022 14:25:12 GMT
- Title: DocLayNet: A Large Human-Annotated Dataset for Document-Layout Analysis
- Authors: Birgit Pfitzmann, Christoph Auer, Michele Dolfi, Ahmed S Nassar, Peter
W J Staar
- Abstract summary: Document layout analysis is a key requirement for high-quality PDF document conversion.
Deep-learning models have proven to be very effective at layout detection and segmentation.
We present textitDocLayNet, a new, publicly available, document- annotation dataset.
- Score: 2.9923891863939938
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Accurate document layout analysis is a key requirement for high-quality PDF
document conversion. With the recent availability of public, large ground-truth
datasets such as PubLayNet and DocBank, deep-learning models have proven to be
very effective at layout detection and segmentation. While these datasets are
of adequate size to train such models, they severely lack in layout variability
since they are sourced from scientific article repositories such as PubMed and
arXiv only. Consequently, the accuracy of the layout segmentation drops
significantly when these models are applied on more challenging and diverse
layouts. In this paper, we present \textit{DocLayNet}, a new, publicly
available, document-layout annotation dataset in COCO format. It contains 80863
manually annotated pages from diverse data sources to represent a wide
variability in layouts. For each PDF page, the layout annotations provide
labelled bounding-boxes with a choice of 11 distinct classes. DocLayNet also
provides a subset of double- and triple-annotated pages to determine the
inter-annotator agreement. In multiple experiments, we provide baseline
accuracy scores (in mAP) for a set of popular object detection models. We also
demonstrate that these models fall approximately 10\% behind the
inter-annotator agreement. Furthermore, we provide evidence that DocLayNet is
of sufficient size. Lastly, we compare models trained on PubLayNet, DocBank and
DocLayNet, showing that layout predictions of the DocLayNet-trained models are
more robust and thus the preferred choice for general-purpose document-layout
analysis.
Related papers
- Lightweight Spatial Modeling for Combinatorial Information Extraction From Documents [31.434507306952458]
We propose KNN-former, which incorporates a new kind of bias in attention calculation based on the K-nearest-neighbor (KNN) graph of document entities.
We also use matching spatial to address the one-to-one mapping property that exists in many documents.
Our method is highly-efficient compared to existing approaches in terms of the number of trainable parameters.
arXiv Detail & Related papers (2024-05-08T10:10:38Z) - RanLayNet: A Dataset for Document Layout Detection used for Domain Adaptation and Generalization [36.973388673687815]
RanLayNet is a synthetic document dataset enriched with automatically assigned labels.
We show that a deep layout identification model trained on our dataset exhibits enhanced performance compared to a model trained solely on actual documents.
arXiv Detail & Related papers (2024-04-15T07:50:15Z) - Enhancing Visually-Rich Document Understanding via Layout Structure
Modeling [91.07963806829237]
We propose GraphLM, a novel document understanding model that injects layout knowledge into the model.
We evaluate our model on various benchmarks, including FUNSD, XFUND and CORD, and achieve state-of-the-art results.
arXiv Detail & Related papers (2023-08-15T13:53:52Z) - DocumentNet: Bridging the Data Gap in Document Pre-Training [78.01647768018485]
We propose a method to collect massive-scale and weakly labeled data from the web to benefit the training of VDER models.
The collected dataset, named DocumentNet, does not depend on specific document types or entity sets.
Experiments on a set of broadly adopted VDER tasks show significant improvements when DocumentNet is incorporated into the pre-training.
arXiv Detail & Related papers (2023-06-15T08:21:15Z) - Are Layout-Infused Language Models Robust to Layout Distribution Shifts?
A Case Study with Scientific Documents [54.744701806413204]
Recent work has shown that infusing layout features into language models (LMs) improves processing of visually-rich documents such as scientific papers.
We test whether layout-infused LMs are robust to layout distribution shifts.
arXiv Detail & Related papers (2023-06-01T18:01:33Z) - GVdoc: Graph-based Visual Document Classification [17.350393956461783]
We propose GVdoc, a graph-based document classification model.
Our approach generates a document graph based on its layout, and then trains a graph neural network to learn node and graph embeddings.
We show that our model, even with fewer parameters, outperforms state-of-the-art models on out-of-distribution data.
arXiv Detail & Related papers (2023-05-26T19:23:20Z) - XDoc: Unified Pre-training for Cross-Format Document Understanding [84.63416346227176]
XDoc is a unified pre-trained model which deals with different document formats in a single model.
XDoc achieves comparable or even better performance on a variety of downstream tasks compared with the individual pre-trained models.
arXiv Detail & Related papers (2022-10-06T12:07:18Z) - Learning Diverse Document Representations with Deep Query Interactions
for Dense Retrieval [79.37614949970013]
We propose a new dense retrieval model which learns diverse document representations with deep query interactions.
Our model encodes each document with a set of generated pseudo-queries to get query-informed, multi-view document representations.
arXiv Detail & Related papers (2022-08-08T16:00:55Z) - Synthetic Document Generator for Annotation-free Layout Recognition [15.657295650492948]
We describe a synthetic document generator that automatically produces realistic documents with labels for spatial positions, extents and categories of layout elements.
We empirically illustrate that a deep layout detection model trained purely on the synthetic documents can match the performance of a model that uses real documents.
arXiv Detail & Related papers (2021-11-11T01:58:44Z) - DocBank: A Benchmark Dataset for Document Layout Analysis [114.81155155508083]
We present textbfDocBank, a benchmark dataset that contains 500K document pages with fine-grained token-level annotations for document layout analysis.
Experiment results show that models trained on DocBank accurately recognize the layout information for a variety of documents.
arXiv Detail & Related papers (2020-06-01T16:04:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.