Historical Document Processing: Historical Document Processing: A Survey
of Techniques, Tools, and Trends
- URL: http://arxiv.org/abs/2002.06300v2
- Date: Fri, 11 Sep 2020 03:09:05 GMT
- Title: Historical Document Processing: Historical Document Processing: A Survey
of Techniques, Tools, and Trends
- Authors: James P. Philips and Nasseh Tabrizi
- Abstract summary: Historical Document Processing is the process of digitizing written material from the past for future use by historians and other scholars.
It incorporates algorithms and software tools from various subfields of computer science, including computer vision, document analysis and recognition, natural language processing, and machine learning.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Historical Document Processing is the process of digitizing written material
from the past for future use by historians and other scholars. It incorporates
algorithms and software tools from various subfields of computer science,
including computer vision, document analysis and recognition, natural language
processing, and machine learning, to convert images of ancient manuscripts,
letters, diaries, and early printed texts automatically into a digital format
usable in data mining and information retrieval systems. Within the past twenty
years, as libraries, museums, and other cultural heritage institutions have
scanned an increasing volume of their historical document archives, the need to
transcribe the full text from these collections has become acute. Since
Historical Document Processing encompasses multiple sub-domains of computer
science, knowledge relevant to its purpose is scattered across numerous
journals and conference proceedings. This paper surveys the major phases of,
standard algorithms, tools, and datasets in the field of Historical Document
Processing, discusses the results of a literature review, and finally suggests
directions for further research.
Related papers
- PHD: Pixel-Based Language Modeling of Historical Documents [55.75201940642297]
We propose a novel method for generating synthetic scans to resemble real historical documents.
We pre-train our model, PHD, on a combination of synthetic scans and real historical newspapers from the 1700-1900 period.
We successfully apply our model to a historical QA task, highlighting its usefulness in this domain.
arXiv Detail & Related papers (2023-10-22T08:45:48Z) - ScrollTimes: Tracing the Provenance of Paintings as a Window into
History [35.605930297790465]
The study of cultural artifact provenance, tracing ownership and preservation, holds significant importance in archaeology and art history.
In collaboration with art historians, we examined the handscroll, a traditional Chinese painting form that provides a rich source of historical data.
We present a three-tiered methodology encompassing artifact, contextual, and provenance levels, designed to create a "Biography" for handscroll.
arXiv Detail & Related papers (2023-06-15T03:38:09Z) - Augraphy: A Data Augmentation Library for Document Images [59.457999432618614]
Augraphy is a Python library for constructing data augmentation pipelines.
It provides strategies to produce augmented versions of clean document images that appear to have been altered by standard office operations.
arXiv Detail & Related papers (2022-08-30T22:36:19Z) - Layout-Aware Information Extraction for Document-Grounded Dialogue:
Dataset, Method and Demonstration [75.47708732473586]
We propose a layout-aware document-level Information Extraction dataset, LIE, to facilitate the study of extracting both structural and semantic knowledge from visually rich documents.
LIE contains 62k annotations of three extraction tasks from 4,061 pages in product and official documents.
Empirical results show that layout is critical for VRD-based extraction, and system demonstration also verifies that the extracted knowledge can help locate the answers that users care about.
arXiv Detail & Related papers (2022-07-14T07:59:45Z) - Open Set Classification of Untranscribed Handwritten Documents [56.0167902098419]
Huge amounts of digital page images of important manuscripts are preserved in archives worldwide.
The class or typology'' of a document is perhaps the most important tag to be included in the metadata.
The technical problem is one of automatic classification of documents, each consisting of a set of untranscribed handwritten text images.
arXiv Detail & Related papers (2022-06-20T20:43:50Z) - A Survey of Historical Document Image Datasets [2.8707038627097226]
This paper presents a systematic literature review of image datasets for document image analysis.
It focuses on historical documents, such as handwritten manuscripts and early prints.
Finding appropriate datasets for historical document analysis is a crucial prerequisite to facilitate research using different machine learning algorithms.
arXiv Detail & Related papers (2022-03-16T09:56:48Z) - Digital Editions as Distant Supervision for Layout Analysis of Printed
Books [76.29918490722902]
We describe methods for exploiting this semantic markup as distant supervision for training and evaluating layout analysis models.
In experiments with several model architectures on the half-million pages of the Deutsches Textarchiv (DTA), we find a high correlation of these region-level evaluation methods with pixel-level and word-level metrics.
We discuss the possibilities for improving accuracy with self-training and the ability of models trained on the DTA to generalize to other historical printed books.
arXiv Detail & Related papers (2021-12-23T16:51:53Z) - Document AI: Benchmarks, Models and Applications [35.46858492311289]
Document AI refers to the techniques for automatically reading, understanding, and analyzing business documents.
In recent years, the popularity of deep learning technology has greatly advanced the development of Document AI.
This paper briefly reviews some of the representative models, tasks, and benchmark datasets.
arXiv Detail & Related papers (2021-11-16T16:43:07Z) - A Survey of Deep Learning Approaches for OCR and Document Understanding [68.65995739708525]
We review different techniques for document understanding for documents written in English.
We consolidate methodologies present in literature to act as a jumping-off point for researchers exploring this area.
arXiv Detail & Related papers (2020-11-27T03:05:59Z) - Handwriting Classification for the Analysis of Art-Historical Documents [6.918282834668529]
We focus on the analysis of handwriting in scanned documents from the art-historic archive of the WPI.
We propose a handwriting classification model that labels extracted text fragments based on their visual structure.
arXiv Detail & Related papers (2020-11-04T13:06:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.