A Survey of Historical Document Image Datasets
- URL: http://arxiv.org/abs/2203.08504v1
- Date: Wed, 16 Mar 2022 09:56:48 GMT
- Title: A Survey of Historical Document Image Datasets
- Authors: Konstantina Nikolaidou, Mathias Seuret, Hamam Mokayed, Marcus Liwicki
- Abstract summary: This paper presents a systematic literature review of image datasets for document image analysis.
It focuses on historical documents, such as handwritten manuscripts and early prints.
Finding appropriate datasets for historical document analysis is a crucial prerequisite to facilitate research using different machine learning algorithms.
- Score: 2.8707038627097226
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper presents a systematic literature review of image datasets for
document image analysis, focusing on historical documents, such as handwritten
manuscripts and early prints. Finding appropriate datasets for historical
document analysis is a crucial prerequisite to facilitate research using
different machine learning algorithms. However, because of the very large
variety of the actual data (e.g., scripts, tasks, dates, support systems, and
amount of deterioration), the different formats for data and label
representation, and the different evaluation processes and benchmarks, finding
appropriate datasets is a difficult task. This work fills this gap, presenting
a meta-study on existing datasets. After a systematic selection process
(according to PRISMA guidelines), we select 56 studies that are chosen based on
different factors, such as the year of publication, number of methods
implemented in the article, reliability of the chosen algorithms, dataset size,
and journal outlet. We summarize each study by assigning it to one of three
pre-defined tasks: document classification, layout structure, or semantic
analysis. We present the statistics, document type, language, tasks, input
visual aspects, and ground truth information for every dataset. In addition, we
provide the benchmark tasks and results from these papers or recent
competitions. We further discuss gaps and challenges in this domain. We
advocate for providing conversion tools to common formats (e.g., COCO format
for computer vision tasks) and always providing a set of evaluation metrics,
instead of just one, to make results comparable across studies.
Related papers
- Masked Image Modeling: A Survey [73.21154550957898]
Masked image modeling emerged as a powerful self-supervised learning technique in computer vision.
We construct a taxonomy and review the most prominent papers in recent years.
We aggregate the performance results of various masked image modeling methods on the most popular datasets.
arXiv Detail & Related papers (2024-08-13T07:27:02Z) - Diffusion Models as Data Mining Tools [87.77999285241219]
This paper demonstrates how to use generative models trained for image synthesis as tools for visual data mining.
We show that after finetuning conditional diffusion models to synthesize images from a specific dataset, we can use these models to define a typicality measure.
This measure assesses how typical visual elements are for different data labels, such as geographic location, time stamps, semantic labels, or even the presence of a disease.
arXiv Detail & Related papers (2024-07-20T17:14:31Z) - U-DIADS-Bib: a full and few-shot pixel-precise dataset for document
layout analysis of ancient manuscripts [9.76730765089929]
U-DIADS-Bib is a novel, pixel-precise, non-overlapping and noiseless document layout analysis dataset developed in close collaboration between specialists in the fields of computer vision and humanities.
We propose a novel, computer-aided, segmentation pipeline in order to alleviate the burden represented by the time-consuming process of manual annotation.
arXiv Detail & Related papers (2024-01-16T15:11:18Z) - On Task-personalized Multimodal Few-shot Learning for Visually-rich
Document Entity Retrieval [59.25292920967197]
Few-shot document entity retrieval (VDER) is an important topic in industrial NLP applications.
FewVEX is a new dataset to boost future research in the field of entity-level few-shot VDER.
We present a task-aware meta-learning based framework, with a central focus on achieving effective task personalization.
arXiv Detail & Related papers (2023-11-01T17:51:43Z) - Prompt me a Dataset: An investigation of text-image prompting for
historical image dataset creation using foundation models [0.9065034043031668]
We present a pipeline for image extraction from historical documents using foundation models.
We evaluate text-image prompts and their effectiveness on humanities datasets of varying levels of complexity.
arXiv Detail & Related papers (2023-09-04T15:37:03Z) - Beyond Document Page Classification: Design, Datasets, and Challenges [32.94494070330065]
This paper highlights the need to bring document classification benchmarking closer to real-world applications.
We identify the lack of public multi-page document classification datasets, formalize different classification tasks arising in application scenarios, and motivate the value of targeting efficient multi-page document representations.
arXiv Detail & Related papers (2023-08-24T16:16:47Z) - A Generic Image Retrieval Method for Date Estimation of Historical
Document Collections [0.4588028371034407]
This paper presents a robust date estimation system based in a retrieval approach that generalizes well in front of heterogeneous collections.
We use a ranking loss function named smooth-nDCG to train a Convolutional Neural Network that learns an ordination of documents for each problem.
arXiv Detail & Related papers (2022-04-08T12:30:39Z) - Towards End-to-End Unified Scene Text Detection and Layout Analysis [60.68100769639923]
We introduce the task of unified scene text detection and layout analysis.
The first hierarchical scene text dataset is introduced to enable this novel research task.
We also propose a novel method that is able to simultaneously detect scene text and form text clusters in a unified way.
arXiv Detail & Related papers (2022-03-28T23:35:45Z) - Efficient Classification of Long Documents Using Transformers [13.927622630633344]
We evaluate the relative efficacy measured against various baselines and diverse datasets.
Results show that more complex models often fail to outperform simple baselines and yield inconsistent performance across datasets.
arXiv Detail & Related papers (2022-03-21T18:36:18Z) - Human-in-the-Loop Disinformation Detection: Stance, Sentiment, or
Something Else? [93.91375268580806]
Both politics and pandemics have recently provided ample motivation for the development of machine learning-enabled disinformation (a.k.a. fake news) detection algorithms.
Existing literature has focused primarily on the fully-automated case, but the resulting techniques cannot reliably detect disinformation on the varied topics, sources, and time scales required for military applications.
By leveraging an already-available analyst as a human-in-the-loop, canonical machine learning techniques of sentiment analysis, aspect-based sentiment analysis, and stance detection become plausible methods to use for a partially-automated disinformation detection system.
arXiv Detail & Related papers (2021-11-09T13:30:34Z) - One-shot Key Information Extraction from Document with Deep Partial
Graph Matching [60.48651298832829]
Key Information Extraction (KIE) from documents improves efficiency, productivity, and security in many industrial scenarios.
Existing supervised learning methods for the KIE task need to feed a large number of labeled samples and learn separate models for different types of documents.
We propose a deep end-to-end trainable network for one-shot KIE using partial graph matching.
arXiv Detail & Related papers (2021-09-26T07:45:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.