Nuremberg Letterbooks: A Multi-Transcriptional Dataset of Early 15th Century Manuscripts for Document Analysis
- URL: http://arxiv.org/abs/2411.07138v1
- Date: Mon, 11 Nov 2024 17:08:40 GMT
- Title: Nuremberg Letterbooks: A Multi-Transcriptional Dataset of Early 15th Century Manuscripts for Document Analysis
- Authors: Martin Mayr, Julian Krenz, Katharina Neumeier, Anna Bub, Simon Bürcky, Nina Brolich, Klaus Herbers, Mechthild Habermann, Peter Fleischmann, Andreas Maier, Vincent Christlein,
- Abstract summary: The Nuremberg Letterbooks dataset comprises historical documents from the early 15th century.
The dataset includes 4 books containing 1711 labeled pages written by 10 scribes.
- Score: 4.660229623034816
- License:
- Abstract: Most datasets in the field of document analysis utilize highly standardized labels, which, while simplifying specific tasks, often produce outputs that are not directly applicable to humanities research. In contrast, the Nuremberg Letterbooks dataset, which comprises historical documents from the early 15th century, addresses this gap by providing multiple types of transcriptions and accompanying metadata. This approach allows for developing methods that are more closely aligned with the needs of the humanities. The dataset includes 4 books containing 1711 labeled pages written by 10 scribes. Three types of transcriptions are provided for handwritten text recognition: Basic, diplomatic, and regularized. For the latter two, versions with and without expanded abbreviations are also available. A combination of letter ID and writer ID supports writer identification due to changing writers within pages. In the technical validation, we established baselines for various tasks, demonstrating data consistency and providing benchmarks for future research to build upon.
Related papers
- Muharaf: Manuscripts of Handwritten Arabic Dataset for Cursive Text Recognition [5.28595286827031]
The Manuscripts of Handwritten Arabic(Muharaf) dataset is a machine learning dataset consisting of more than 1,600 historic handwritten page images.
This dataset was compiled to advance the state of the art in handwritten text recognition.
arXiv Detail & Related papers (2024-06-13T23:40:34Z) - MDIW-13: a New Multi-Lingual and Multi-Script Database and Benchmark for Script Identification [19.021909090693505]
This paper provides a new database for benchmarking script identification algorithms.
The dataset consists of 1,135 documents scanned from local newspaper and handwritten letters as well as notes from different native writers.
Easy-to-go benchmarks are proposed with handcrafted and deep learning methods.
arXiv Detail & Related papers (2024-05-29T09:29:09Z) - DELINE8K: A Synthetic Data Pipeline for the Semantic Segmentation of Historical Documents [0.0]
Document semantic segmentation can facilitate document analysis tasks, including OCR, form classification, and document editing.
Several synthetic datasets have been developed to distinguish handwriting from printed text, but they fall short in class variety and document diversity.
We propose the most comprehensive document semantic segmentation pipeline to date, incorporating preprinted text, handwriting, and document backgrounds from over 10 sources.
Our customized dataset exhibits superior performance on the NAFSS benchmark, demonstrating it as a promising tool in further research.
arXiv Detail & Related papers (2024-04-30T04:53:10Z) - Document Layout Annotation: Database and Benchmark in the Domain of
Public Affairs [62.38140271294419]
We propose a procedure to semi-automatically annotate digital documents with different layout labels.
We collect a novel database for DLA in the public affairs domain using a set of 24 data sources from the Spanish Administration.
The results of our experiments validate the proposed text labeling procedure with accuracy up to 99%.
arXiv Detail & Related papers (2023-06-12T08:21:50Z) - DAPR: A Benchmark on Document-Aware Passage Retrieval [57.45793782107218]
We propose and name this task emphDocument-Aware Passage Retrieval (DAPR)
While analyzing the errors of the State-of-The-Art (SoTA) passage retrievers, we find the major errors (53.5%) are due to missing document context.
Our created benchmark enables future research on developing and comparing retrieval systems for the new task.
arXiv Detail & Related papers (2023-05-23T10:39:57Z) - PART: Pre-trained Authorship Representation Transformer [64.78260098263489]
Authors writing documents imprint identifying information within their texts: vocabulary, registry, punctuation, misspellings, or even emoji usage.
Previous works use hand-crafted features or classification tasks to train their authorship models, leading to poor performance on out-of-domain authors.
We propose a contrastively trained model fit to learn textbfauthorship embeddings instead of semantics.
arXiv Detail & Related papers (2022-09-30T11:08:39Z) - Razmecheno: Named Entity Recognition from Digital Archive of Diaries
"Prozhito" [1.4823641127537543]
This paper aims to create a novel dataset "Razmecheno", gathered from the diary texts of the project "Prozhito" in Russian.
Razmecheno comprises 1331 sentences and 14119 tokens, sampled from diaries, written during the Perestroika.
arXiv Detail & Related papers (2022-01-24T23:06:01Z) - SCROLLS: Standardized CompaRison Over Long Language Sequences [62.574959194373264]
We introduce SCROLLS, a suite of tasks that require reasoning over long texts.
SCROLLS contains summarization, question answering, and natural language inference tasks.
We make all datasets available in a unified text-to-text format and host a live leaderboard to facilitate research on model architecture and pretraining methods.
arXiv Detail & Related papers (2022-01-10T18:47:15Z) - Digital Editions as Distant Supervision for Layout Analysis of Printed
Books [76.29918490722902]
We describe methods for exploiting this semantic markup as distant supervision for training and evaluating layout analysis models.
In experiments with several model architectures on the half-million pages of the Deutsches Textarchiv (DTA), we find a high correlation of these region-level evaluation methods with pixel-level and word-level metrics.
We discuss the possibilities for improving accuracy with self-training and the ability of models trained on the DTA to generalize to other historical printed books.
arXiv Detail & Related papers (2021-12-23T16:51:53Z) - Letter-level Online Writer Identification [86.13203975836556]
We focus on a novel problem, letter-level online writer-id, which requires only a few trajectories of written letters as identification cues.
A main challenge is that a person often writes a letter in different styles from time to time.
We refer to this problem as the variance of online writing styles (Var-O-Styles)
arXiv Detail & Related papers (2021-12-06T07:21:53Z) - Handwriting Classification for the Analysis of Art-Historical Documents [6.918282834668529]
We focus on the analysis of handwriting in scanned documents from the art-historic archive of the WPI.
We propose a handwriting classification model that labels extracted text fragments based on their visual structure.
arXiv Detail & Related papers (2020-11-04T13:06:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.