Related papers: Nuremberg Letterbooks: A Multi-Transcriptional Dataset of Early 15th Century Manuscripts for Document Analysis

Nuremberg Letterbooks: A Multi-Transcriptional Dataset of Early 15th Century Manuscripts for Document Analysis

URL: http://arxiv.org/abs/2411.07138v1
Date: Mon, 11 Nov 2024 17:08:40 GMT
Title: Nuremberg Letterbooks: A Multi-Transcriptional Dataset of Early 15th Century Manuscripts for Document Analysis
Authors: Martin Mayr, Julian Krenz, Katharina Neumeier, Anna Bub, Simon Bürcky, Nina Brolich, Klaus Herbers, Mechthild Habermann, Peter Fleischmann, Andreas Maier, Vincent Christlein,
Abstract summary: The Nuremberg Letterbooks dataset comprises historical documents from the early 15th century. The dataset includes 4 books containing 1711 labeled pages written by 10 scribes.
Score: 4.660229623034816
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Most datasets in the field of document analysis utilize highly standardized labels, which, while simplifying specific tasks, often produce outputs that are not directly applicable to humanities research. In contrast, the Nuremberg Letterbooks dataset, which comprises historical documents from the early 15th century, addresses this gap by providing multiple types of transcriptions and accompanying metadata. This approach allows for developing methods that are more closely aligned with the needs of the humanities. The dataset includes 4 books containing 1711 labeled pages written by 10 scribes. Three types of transcriptions are provided for handwritten text recognition: Basic, diplomatic, and regularized. For the latter two, versions with and without expanded abbreviations are also available. A combination of letter ID and writer ID supports writer identification due to changing writers within pages. In the technical validation, we established baselines for various tasks, demonstrating data consistency and providing benchmarks for future research to build upon.

Related papers

DohaScript: A Large-Scale Multi-Writer Dataset for Continuous Handwritten Hindi Text [1.299941371793082]
We introduce DohaScript, a large scale, multi writer dataset of handwritten Hindi text collected from 531 unique contributors.<n>The dataset is designed as a parallel stylistic corpus, in which all writers transcribe the same fixed set of six traditional Hindi dohas (couplets)<n>DohaScript is intended to serve as a standardized and reproducible benchmark for advancing research on continuous handwritten Devanagari text in low resource script settings.
arXiv Detail & Related papers (2026-02-20T09:25:14Z)
UniRec-0.1B: Unified Text and Formula Recognition with 0.1B Parameters [55.34921520578968]
Vision-language models (VLMs) have achieved impressive unified recognition of text and formulas.<n>We propose UniRec-0.1B, a unified recognition model with only 0.1B parameters.<n>It is capable of performing text and formula recognition at multiple levels, including characters, words, lines, paragraphs, and documents.
arXiv Detail & Related papers (2025-12-24T10:35:21Z)
Muharaf: Manuscripts of Handwritten Arabic Dataset for Cursive Text Recognition [5.28595286827031]
The Manuscripts of Handwritten Arabic(Muharaf) dataset is a machine learning dataset consisting of more than 1,600 historic handwritten page images. This dataset was compiled to advance the state of the art in handwritten text recognition.
arXiv Detail & Related papers (2024-06-13T23:40:34Z)
MDIW-13: a New Multi-Lingual and Multi-Script Database and Benchmark for Script Identification [19.021909090693505]
This paper provides a new database for benchmarking script identification algorithms. The dataset consists of 1,135 documents scanned from local newspaper and handwritten letters as well as notes from different native writers. Easy-to-go benchmarks are proposed with handcrafted and deep learning methods.
arXiv Detail & Related papers (2024-05-29T09:29:09Z)
DELINE8K: A Synthetic Data Pipeline for the Semantic Segmentation of Historical Documents [0.0]
Document semantic segmentation can facilitate document analysis tasks, including OCR, form classification, and document editing. Several synthetic datasets have been developed to distinguish handwriting from printed text, but they fall short in class variety and document diversity. We propose the most comprehensive document semantic segmentation pipeline to date, incorporating preprinted text, handwriting, and document backgrounds from over 10 sources. Our customized dataset exhibits superior performance on the NAFSS benchmark, demonstrating it as a promising tool in further research.
arXiv Detail & Related papers (2024-04-30T04:53:10Z)
Document Layout Annotation: Database and Benchmark in the Domain of Public Affairs [62.38140271294419]
We propose a procedure to semi-automatically annotate digital documents with different layout labels. We collect a novel database for DLA in the public affairs domain using a set of 24 data sources from the Spanish Administration. The results of our experiments validate the proposed text labeling procedure with accuracy up to 99%.
arXiv Detail & Related papers (2023-06-12T08:21:50Z)
DAPR: A Benchmark on Document-Aware Passage Retrieval [57.45793782107218]
We propose and name this task emphDocument-Aware Passage Retrieval (DAPR) While analyzing the errors of the State-of-The-Art (SoTA) passage retrievers, we find the major errors (53.5%) are due to missing document context. Our created benchmark enables future research on developing and comparing retrieval systems for the new task.
arXiv Detail & Related papers (2023-05-23T10:39:57Z)
PART: Pre-trained Authorship Representation Transformer [64.78260098263489]
Authors writing documents imprint identifying information within their texts: vocabulary, registry, punctuation, misspellings, or even emoji usage. Previous works use hand-crafted features or classification tasks to train their authorship models, leading to poor performance on out-of-domain authors. We propose a contrastively trained model fit to learn textbfauthorship embeddings instead of semantics.
arXiv Detail & Related papers (2022-09-30T11:08:39Z)
LDKP: A Dataset for Identifying Keyphrases from Long Scientific Documents [48.84086818702328]
Identifying keyphrases (KPs) from text documents is a fundamental task in natural language processing and information retrieval. Vast majority of the benchmark datasets for this task are from the scientific domain containing only the document title and abstract information. This presents three challenges for real-world applications: human-written summaries are unavailable for most documents, the documents are almost always long, and a high percentage of KPs are directly found beyond the limited context of title and abstract.
arXiv Detail & Related papers (2022-03-29T08:44:57Z)
Razmecheno: Named Entity Recognition from Digital Archive of Diaries "Prozhito" [1.4823641127537543]
This paper aims to create a novel dataset "Razmecheno", gathered from the diary texts of the project "Prozhito" in Russian. Razmecheno comprises 1331 sentences and 14119 tokens, sampled from diaries, written during the Perestroika.
arXiv Detail & Related papers (2022-01-24T23:06:01Z)
SCROLLS: Standardized CompaRison Over Long Language Sequences [62.574959194373264]
We introduce SCROLLS, a suite of tasks that require reasoning over long texts. SCROLLS contains summarization, question answering, and natural language inference tasks. We make all datasets available in a unified text-to-text format and host a live leaderboard to facilitate research on model architecture and pretraining methods.
arXiv Detail & Related papers (2022-01-10T18:47:15Z)
Digital Editions as Distant Supervision for Layout Analysis of Printed Books [76.29918490722902]
We describe methods for exploiting this semantic markup as distant supervision for training and evaluating layout analysis models. In experiments with several model architectures on the half-million pages of the Deutsches Textarchiv (DTA), we find a high correlation of these region-level evaluation methods with pixel-level and word-level metrics. We discuss the possibilities for improving accuracy with self-training and the ability of models trained on the DTA to generalize to other historical printed books.
arXiv Detail & Related papers (2021-12-23T16:51:53Z)
Letter-level Online Writer Identification [86.13203975836556]
We focus on a novel problem, letter-level online writer-id, which requires only a few trajectories of written letters as identification cues. A main challenge is that a person often writes a letter in different styles from time to time. We refer to this problem as the variance of online writing styles (Var-O-Styles)
arXiv Detail & Related papers (2021-12-06T07:21:53Z)
Handwriting Classification for the Analysis of Art-Historical Documents [6.918282834668529]
We focus on the analysis of handwriting in scanned documents from the art-historic archive of the WPI. We propose a handwriting classification model that labels extracted text fragments based on their visual structure.
arXiv Detail & Related papers (2020-11-04T13:06:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.