AnnoPage Dataset: Dataset of Non-Textual Elements in Documents with Fine-Grained Categorization
- URL: http://arxiv.org/abs/2503.22526v1
- Date: Fri, 28 Mar 2025 15:30:42 GMT
- Title: AnnoPage Dataset: Dataset of Non-Textual Elements in Documents with Fine-Grained Categorization
- Authors: Martin Kišš, Michal Hradiš, Martina Dvořáková, Václav Jiroušek, Filip Kersch,
- Abstract summary: The AnnoPage dataset is a collection of 7550 pages from historical documents, primarily in Czech and German, spanning from 1485 to the present.<n>The dataset is designed to support research in document layout analysis and object detection.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce the AnnoPage Dataset, a novel collection of 7550 pages from historical documents, primarily in Czech and German, spanning from 1485 to the present, focusing on the late 19th and early 20th centuries. The dataset is designed to support research in document layout analysis and object detection. Each page is annotated with axis-aligned bounding boxes (AABB) representing elements of 25 categories of non-textual elements, such as images, maps, decorative elements, or charts, following the Czech Methodology of image document processing. The annotations were created by expert librarians to ensure accuracy and consistency. The dataset also incorporates pages from multiple, mainly historical, document datasets to enhance variability and maintain continuity. The dataset is divided into development and test subsets, with the test set carefully selected to maintain the category distribution. We provide baseline results using YOLO and DETR object detectors, offering a reference point for future research. The AnnoPage Dataset is publicly available on Zenodo (https://doi.org/10.5281/zenodo.12788419), along with ground-truth annotations in YOLO format.
Related papers
- BiblioPage: A Dataset of Scanned Title Pages for Bibliographic Metadata Extraction [0.0]
BiblioPage is a dataset of scanned title pages annotated with structured metadata.<n>The dataset consists of approximately 2,000 title pages collected from 14 Czech libraries.<n>We valuated object detection models such as YOLO and DETR combined with transformer-based OCR, achieving a maximum mAP of 52 and an F1 score of 59.
arXiv Detail & Related papers (2025-03-25T13:46:55Z) - TextBite: A Historical Czech Document Dataset for Logical Page Segmentation [0.0]
Previous approaches have relied on OCR or precise geometry to define logical segmentation.<n>To avoid the need for OCR, we define the task purely as segmentation in the image domain.<n>We introduce TextBite, a dataset of historical Czech documents spanning the 18th to 20th centuries.<n>The dataset comprises 8,449 page images with 78,863 annotated segments of logically and thematically coherent text.
arXiv Detail & Related papers (2025-03-20T19:19:12Z) - Diachronic Document Dataset for Semantic Layout Analysis [9.145289299764991]
This dataset includes 7,254 annotated pages spanning a large temporal range (1600-2024) of digitised and born-digital materials.
By incorporating content from different periods and genres, it addresses varying layout complexities and historical changes in document structure.
We evaluate object detection models on this dataset, examining the impact of input size and subset-based training.
arXiv Detail & Related papers (2024-11-15T09:33:13Z) - DELINE8K: A Synthetic Data Pipeline for the Semantic Segmentation of Historical Documents [0.0]
Document semantic segmentation can facilitate document analysis tasks, including OCR, form classification, and document editing.
Several synthetic datasets have been developed to distinguish handwriting from printed text, but they fall short in class variety and document diversity.
We propose the most comprehensive document semantic segmentation pipeline to date, incorporating preprinted text, handwriting, and document backgrounds from over 10 sources.
Our customized dataset exhibits superior performance on the NAFSS benchmark, demonstrating it as a promising tool in further research.
arXiv Detail & Related papers (2024-04-30T04:53:10Z) - OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text
Documents [122.55393759474181]
We introduce OBELICS, an open web-scale filtered dataset of interleaved image-text documents.
We describe the dataset creation process, present comprehensive filtering rules, and provide an analysis of the dataset's content.
We train vision and language models of 9 and 80 billion parameters named IDEFICS, and obtain competitive performance on different multimodal benchmarks.
arXiv Detail & Related papers (2023-06-21T14:01:01Z) - Minimally-Supervised Structure-Rich Text Categorization via Learning on
Text-Rich Networks [61.23408995934415]
We propose a novel framework for minimally supervised categorization by learning from the text-rich network.
Specifically, we jointly train two modules with different inductive biases -- a text analysis module for text understanding and a network learning module for class-discriminative, scalable network learning.
Our experiments show that given only three seed documents per category, our framework can achieve an accuracy of about 92%.
arXiv Detail & Related papers (2021-02-23T04:14:34Z) - DocBank: A Benchmark Dataset for Document Layout Analysis [114.81155155508083]
We present textbfDocBank, a benchmark dataset that contains 500K document pages with fine-grained token-level annotations for document layout analysis.
Experiment results show that models trained on DocBank accurately recognize the layout information for a variety of documents.
arXiv Detail & Related papers (2020-06-01T16:04:30Z) - SciREX: A Challenge Dataset for Document-Level Information Extraction [56.83748634747753]
It is challenging to create a large-scale information extraction dataset at the document level.
We introduce SciREX, a document level IE dataset that encompasses multiple IE tasks.
We develop a neural model as a strong baseline that extends previous state-of-the-art IE models to document-level IE.
arXiv Detail & Related papers (2020-05-01T17:30:10Z) - A Large Dataset of Historical Japanese Documents with Complex Layouts [5.343406649012619]
HJDataset is a large dataset of historical Japanese documents with complex layouts.
It contains over 250,000 layout element annotations seven types.
A semi-rule based method is developed to extract the layout elements, and the results are checked by human inspectors.
arXiv Detail & Related papers (2020-04-18T18:38:25Z) - Learning to Summarize Passages: Mining Passage-Summary Pairs from
Wikipedia Revision Histories [110.54963847339775]
We propose a method for automatically constructing a passage-to-summary dataset by mining the Wikipedia page revision histories.
In particular, the method mines the main body passages and the introduction sentences which are added to the pages simultaneously.
The constructed dataset contains more than one hundred thousand passage-summary pairs.
arXiv Detail & Related papers (2020-04-06T12:11:50Z) - Learning to Select Bi-Aspect Information for Document-Scale Text Content
Manipulation [50.01708049531156]
We focus on a new practical task, document-scale text content manipulation, which is the opposite of text style transfer.
In detail, the input is a set of structured records and a reference text for describing another recordset.
The output is a summary that accurately describes the partial content in the source recordset with the same writing style of the reference.
arXiv Detail & Related papers (2020-02-24T12:52:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.