The Effects of Character-Level Data Augmentation on Style-Based Dating
of Historical Manuscripts
- URL: http://arxiv.org/abs/2212.07923v1
- Date: Thu, 15 Dec 2022 15:55:44 GMT
- Title: The Effects of Character-Level Data Augmentation on Style-Based Dating
of Historical Manuscripts
- Authors: Lisa Koopmans, Maruf A. Dhali and Lambert Schomaker
- Abstract summary: This article explores the influence of data augmentation on the dating of historical manuscripts.
Linear Support Vector Machines were trained with k-fold cross-validation on textural and grapheme-based features extracted from historical manuscripts.
Results show that training models with augmented data improve the performance of historical manuscripts dating by 1% - 3% in cumulative scores.
- Score: 5.285396202883411
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Identifying the production dates of historical manuscripts is one of the main
goals for paleographers when studying ancient documents. Automatized methods
can provide paleographers with objective tools to estimate dates more
accurately. Previously, statistical features have been used to date digitized
historical manuscripts based on the hypothesis that handwriting styles change
over periods. However, the sparse availability of such documents poses a
challenge in obtaining robust systems. Hence, the research of this article
explores the influence of data augmentation on the dating of historical
manuscripts. Linear Support Vector Machines were trained with k-fold
cross-validation on textural and grapheme-based features extracted from
historical manuscripts of different collections, including the Medieval
Paleographical Scale, early Aramaic manuscripts, and the Dead Sea Scrolls.
Results show that training models with augmented data improve the performance
of historical manuscripts dating by 1% - 3% in cumulative scores. Additionally,
this indicates further enhancement possibilities by considering models specific
to the features and the documents' scripts.
Related papers
- A Bayesian Approach to Harnessing the Power of LLMs in Authorship Attribution [57.309390098903]
Authorship attribution aims to identify the origin or author of a document.
Large Language Models (LLMs) with their deep reasoning capabilities and ability to maintain long-range textual associations offer a promising alternative.
Our results on the IMDb and blog datasets show an impressive 85% accuracy in one-shot authorship classification across ten authors.
arXiv Detail & Related papers (2024-10-29T04:14:23Z) - Contrastive Entity Coreference and Disambiguation for Historical Texts [2.446672595462589]
Existing entity disambiguation methods often fall short in accuracy for historical documents, which are replete with individuals not remembered in contemporary knowledgebases.
This study makes three key contributions to improve cross-document coreference resolution and disambiguation in historical texts.
arXiv Detail & Related papers (2024-06-21T18:22:14Z) - PHD: Pixel-Based Language Modeling of Historical Documents [55.75201940642297]
We propose a novel method for generating synthetic scans to resemble real historical documents.
We pre-train our model, PHD, on a combination of synthetic scans and real historical newspapers from the 1700-1900 period.
We successfully apply our model to a historical QA task, highlighting its usefulness in this domain.
arXiv Detail & Related papers (2023-10-22T08:45:48Z) - How to Choose Pretrained Handwriting Recognition Models for Single
Writer Fine-Tuning [23.274139396706264]
Recent advancements in Deep Learning-based Handwritten Text Recognition (HTR) have led to models with remarkable performance on modern and historical manuscripts.
Those models struggle to obtain the same performance when applied to manuscripts with peculiar characteristics, such as language, paper support, ink, and author handwriting.
In this paper, we take into account large, real benchmark datasets and synthetic ones obtained with a styled Handwritten Text Generation model.
We give a quantitative indication of the most relevant characteristics of such data for obtaining an HTR model able to effectively transcribe manuscripts in small collections with as little as five real fine-tuning lines
arXiv Detail & Related papers (2023-05-04T07:00:28Z) - Recognizing Handwriting Styles in a Historical Scanned Document Using
Unsupervised Fuzzy Clustering [0.0]
Unique handwriting styles may be dissimilar in a blend of several factors including character size, stroke width, loops, ductus, slant angles, and cursive ligatures.
Previous work on labeled data with Hidden Markov models, support vector machines, and semi-supervised recurrent neural networks have provided moderate to high success.
In this study, we successfully detect hand shifts in a historical manuscript through fuzzy soft clustering in combination with linear principal component analysis.
arXiv Detail & Related papers (2022-10-30T09:07:51Z) - PART: Pre-trained Authorship Representation Transformer [64.78260098263489]
Authors writing documents imprint identifying information within their texts: vocabulary, registry, punctuation, misspellings, or even emoji usage.
Previous works use hand-crafted features or classification tasks to train their authorship models, leading to poor performance on out-of-domain authors.
We propose a contrastively trained model fit to learn textbfauthorship embeddings instead of semantics.
arXiv Detail & Related papers (2022-09-30T11:08:39Z) - Augraphy: A Data Augmentation Library for Document Images [59.457999432618614]
Augraphy is a Python library for constructing data augmentation pipelines.
It provides strategies to produce augmented versions of clean document images that appear to have been altered by standard office operations.
arXiv Detail & Related papers (2022-08-30T22:36:19Z) - The LAM Dataset: A Novel Benchmark for Line-Level Handwritten Text
Recognition [40.20527158935902]
Handwritten Text Recognition (HTR) is an open problem at the intersection of Computer Vision and Natural Language Processing.
We present the Ludovico Antonio Muratori dataset, a large line-level HTR dataset of Italian ancient manuscripts edited by a single author over 60 years.
arXiv Detail & Related papers (2022-08-16T11:44:16Z) - A Generic Image Retrieval Method for Date Estimation of Historical
Document Collections [0.4588028371034407]
This paper presents a robust date estimation system based in a retrieval approach that generalizes well in front of heterogeneous collections.
We use a ranking loss function named smooth-nDCG to train a Convolutional Neural Network that learns an ordination of documents for each problem.
arXiv Detail & Related papers (2022-04-08T12:30:39Z) - Digital Editions as Distant Supervision for Layout Analysis of Printed
Books [76.29918490722902]
We describe methods for exploiting this semantic markup as distant supervision for training and evaluating layout analysis models.
In experiments with several model architectures on the half-million pages of the Deutsches Textarchiv (DTA), we find a high correlation of these region-level evaluation methods with pixel-level and word-level metrics.
We discuss the possibilities for improving accuracy with self-training and the ability of models trained on the DTA to generalize to other historical printed books.
arXiv Detail & Related papers (2021-12-23T16:51:53Z) - Neural Language Modeling for Contextualized Temporal Graph Generation [49.21890450444187]
This paper presents the first study on using large-scale pre-trained language models for automated generation of an event-level temporal graph for a document.
arXiv Detail & Related papers (2020-10-20T07:08:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.